Zhou et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017
Problem Definition
- Raw Point Cloud:
使用64環的LIDAR Sensor蒐集而得 (頻率約為1秒10圈, 每圈約可收集到十萬個反射點), 每個單一的點共有4個維度的資料: 座標位置(x, y, z)以及反射強度(Intensity)。
- Problem:
Input: Raw Point Cloud
Output: 3D Bounding Box Prediction
- Dataset:
KITTI Dataset (7000 data with bounding box annotation)
Architecture
- Step 1: Feature Learning Network
- Input: Raw Point Cloud
- Output: 4D Sparse Tensor (Feature Map)
- Step 2: Convolution Middle Layers
- Input: 4D Sparse Tensor (Feature Map)
- Output: High-Level Feature (Reducing the size)
- Step 3: Region Proposal Networks
- Input: High-Level Feature (4D)
- Output: Class and Bounding Box
Problem:
How to visualize high-dimensional dataSolution:
giving each datapoint a location in a two or three-dimensional mapAdvantage:
a. easier to optimize than other existing techniques
b. could reduce the tendency to crowd points together in the center of the map
c. could create a single map that reveals structure at many different scales
d. could use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed
Contribution and Disscussion
Propose an iterative online algorithm which could slove the dictionary learning problem by efficienty minimizing at eacch step a quadratic surrogate function of the empirical cost over the set of constraints.
This algorithm is faster than previous approaches to dictionary learning on both small and large datasets of natural images.
Method
Background Knowledge -- Sparse Coding
1. Change of basis
In linear algebra, we may change the coordinates relative to a basis to another, in order to make the problem we are going to solve more easily.
Using different coordinates is like we look at the problem in a different view. This may helps us to find some hidden unformation and structure lies behind.
2. Over-complete set of basis vectors
After removing one basis from the set, the basis set is still complete (every element in input set X can be approximated arbitrarily well in norm by finite linear combinations of basis in basis set).
3. Sparse coding (sparse dictionary learning)
Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set.
This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.
(Copy from Wiki)
a. [Adv] Could capture structures and patterns inherent in the input data
b. [Con] With an over-complete basis, the coefficients ai are no longer uniquely determined by the input vector
c. The sparse coding cost function on a set of m input vectors was defined as:
where is a sparsity cost function which penalizes for being far from zero.
(From UFLDL)
Problem Statement
Classical dictionary learning problem:
could bw defined as the optimal value of the problem:
Under the constraint ( C is the convex set of matrices), to prevent D from being arbitrarily large (which would lead to arbitrarily small values of α):
Algorithm Outline of Online Dictionary Learning
Peng et al., ICCV 2016
Poblem Definition
Want to use CAD models to generate synthetic 2D images for training. This paper arms to deal with the lacking cue information, such as texture, pose and background.
- Field: Few-Shot Object Detection
Cue invariance: The ability of the network to extract the equivalent high-level category information despite missing low-level cues.
Contribution and Discussion
This paper roposed a new method for learning object detectors for new categories that avoids the need for costly large-scale image annotation based on findings as follow:
- Demonstrated that synthetic CAD training of modern deep CNNs object detectors can be successful when real-image training data for novel objects or domains is limited.
- Show that training on synthetic images with simulated cues lead to similar performance as training on synthetic images without cues.
- For new categories, adding synthetic variance and fine-tuning the layers proved useful.
Method
1. Synthetic Generation of Low-Level Cues
Given a set of 3D CAD models for each object, it generates a synthetic 2D image training dataset by simulating various low-level cues.
The chooisen subset of cue information are: object texture, color, 3D pose and 3D shape, as well as background scene texture and color. When learning a detection model for a new category with limited labeled real data, the choice of whether or not to simulate these cues in the synthetic data depends on the invariance of the representation. For example, if the representation is invariant to color, grayscale images can be rendered.
To generating a virtual image:
- randomly select a background image from the available background pool
- project the selected background image onto the image plane
- select a random texture image from the texture pool
- map the selected texture image onto the CAD model before rendering the object.
2. Deep Convolutional Neural Network Features
To extract positive and negative patches for each object from the synthetic images (and an optional small number of real images).
Each patch is fed into a deep neural network that computes feature activations, which are used to train the final classifier, as in the deep detection method of RCNN.
3. Analysing Cue Invariance of DCNN Features
Here they create two synthetic training sets, one with and one without a particular cue. They extract deep features from both sets, train two object detectors, and compare their performance on real test data.
- Hypothesis: If the representation is invariant to the cue, then similar high-level neurons will activate whether or not that cue is present in the input image, leading to similar category-level information at training and thus similar performance.
Here is their results:
Veit at al., CVPR 2017.
Poblem Definition
To learn the mapping between noisy and clean annotations in order to leverage the information cantained n the clean set.
- Field: Object Classification
- Assumption: large data with noisy annotations and small cleanly-annotated subset
- Dataset: OpenImage (~ 9 million data with ~40k cleanly-annotated images)
- Aspects:
- label space is typically highly structured
- many classes aan have multiple semantic modes
- Want: To find dependence of annotation noise on the input image
Method
There are two supervised classifiers in this approach. One is the label cleaning network which using a residual architecture to learn the difference between the noisy and clean labels. The other one is the image classifier which takes the label predicted by the label cleaning network as the ground truth if the given image doesn't have a clean label.
The label cleaning network has some cool ideas list as follows:
1. Two Separate Input
- Two separate inputs:
- the noisy label
- the visual features (extracted by Inception V3)
- Goal:
- To mode the label structure and noisy condition on the image
- Process:
- The sparse noisy labels y is treates as a bag of words
- Both y and the visual features are projected into a low dimensonal embedding that encodes the set of labels
- The embedding vectors are concatenated, transformed with a hidden layer, than projection back into the high dimensional label space
2. Identity-skip connection
This procedure were inspired by another cool paper written by K He, Deep Residual Learning for Image Recognition, the best paper of CVPR 2016. They reformulated the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions, proved that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
This approach adds the noisy labels from the training set to the output of the cleaning module. As a result, the network only needs to learn the difference between the noisy and clean labels instead of regressing the entire label vector.
When no human rated data is available, the label cleaning network defaults to not changing the noisy labels. As more verified groundtruth becomes available, the network gracefully adapts and cleans the labels.
Here is the result:
Weekly Paper Reading List
by Ying Li (R05922016), aMMAI course
Lecture 02
- Yunchao Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. CVPR 2011
Lecture 03 -- not enough labeled data
- Veit at al. Learning From Noisy Large-Scale Datasets With Minimal Supervision. CVPR 2017.
- Peng et al., Learning Deep Object Detectors from 3D Models. ICCV 2016.
Lecture 05 -- Latent Semantic Analysis
= Li Wan et. al. “A Hybrid Neural Network-Latent Topic Model”, JMLR 2012
Lecture 06 -- sparse coding
- Mairal et al. Online dictionary learning for sparse coding. ICML 2009
Lecture 07
- Laurens van der Maaten, Geoffrey Hinton. Visualizing Data using t-SNE; The Journal of Machine Learning Research, 2008.
Lecture 10 - Learning from 3D Sensors
- Zhou et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017
Iterative Quantization: A Procrustean Approach to Learning Binary Codes
Yunchao Gong and S. Lazebnik. CVPR 2011
Poblem Definition
learning similarity-preserving binary codes for efficient retival in large-scale image collections
1. An effective scheme for learning binary codes should have properties as following,
- The codes should be short. (For storing in memory effectively)
- The codes should map the similarity of images.
- The algorithms should be efficient for learning the parameters of the binary code and ecoding a new test image.
2. Goal
- To learn good binary codes without any supervisory information in the form of class labels.
3. Problem formulation
- To directly minimize the quantization error of mapping this data to vertices of binary hypercube.
Contribution and Discussion
1. Contribution
- To show that rotate the projected data could improve the performance of PCA-based binary coding schemes.
- To demonstrate an iterative quantization method for refining this rotation.
2. Limitation
- Data-Dependent:
- ITQ (Iterative Quantization) uses one bit per projected data dimension.
Method
Step 1. Dimensionality Reduction
To yields PCA projections, this step follows the maximum variance formulation from the two papers, Spectral hashing (Y. Weiss et al. In NIPS 2008) and Semi-supervised hashing for large-scale image retrieval (J. Wang et al. In CVPR 2010).
Here, they want to produce an efficient code in which*the variance of each bit is maximized* and the bits are pairwise uncorrelated. The variance is maximized by encoding functions that produce exactly balanced bits. This could be get through the following continuous objective function:
Step 2. Binary Quantization
To preaerve the locality structure of the projected data by rotating it so as to minimize the discretization error. The ideal is illustrated in the Figure 1.
First, the quantization loss is defined in Equation 2:
This step is beginning with the random initialization of R, then adopt a k-means-like iterative quantization (ITQ) procedure to find a local minimum of the quantization loss.
In each iteration:
- Fix R and update B: Each dat point is assigned to the nearest vertex of the binary hypercube
- Fix B and update R: R is updated to minimize the quantization loss
- Alternate between steps 1 and 2 to find a locally optimal solution.
Here is one sample to use ITQ to encode the images and than retrieve images using 32 bits:
---參加 Modern Web Conference 2015 後感
這句話現在是FB社團Front-End Developers Taiwan新的Group Photo.
或許對於那些資深工程師來說，這是在自我提醒還有很多不足？
對我而言，這是在告訴我：不要害怕提問，更不要害怕分享。
這個社群，比我想像的還要開放，還要包容。
第一天到會場的時候,我在會議室的第二排找到了一個位置.向後方的座位望去,是一排一排的apple和各式的小筆電,儘管還不到九點,開場也還要再等半個多小時,剩下的空位也零零散散了.螢幕在subline和facebook間切換,這一群來朝聖JavaScript之父的工程師們,埋首在筆電中,正把握著零碎的時間coding著.
燈光很明亮,我卻突然開始感到惶恐,與莫名的興奮--我現在,置身於一群人才之中--這個認知讓我既興奮又惶恐.
惶恐,因為隨著附近的聊天聲傳入耳中,那些熟悉又陌生的名詞,術語與觀點,讓我越發清晰地感受到我和這群人之間的差距.
興奮,因為我知道我的努力將讓我一步一步地走進這個圈子.
一場一場的演講，一次又一次的衝擊，我看見當紅的技術轉化成了一個一個的關鍵字迎面像我砸來；我幾乎招架不住，手中的鋼筆迅速地在紙面滑過，努力地想要留住每一個可以回去google的關鍵字，以及每一瞬間的感動。我著迷於Webduino用webComponent將IOT的coding模組化的神奇想法，沉浸於讓mobile web變得更好用的眉眉角角，對於各種用web所呈現、視覺化的資料目不轉睛......恨不得我能夠左右開弓，多一隻手來記錄捕捉。
20張AF的筆記和大家在hackpad上分享的筆記、slide和錄音錄影，幫我累積了滿滿的關鍵字，我可以用未來好長的一段時間慢慢學習、消化。
不過，對我的影響較為深遠的，是唐鳳的「開源之道」：
「最好」是「比較好」的敵人
Rough, 粗略的共識。要求精確就變成了粗暴的共識。
不斷的發表進度──這不是自吹自擂，因為提供一個答案，就會有高手來糾正你。萬事萬物都有缺口，缺口就是光的入口。
....
以及對我影響最深的：
RTFM
RTFM --- 去看該死的說明書
唐鳳的這一段演講，讓我學到了兩件事：
1. 不要害怕提問
2. 不要拒絕向我提問的人，要認真回應。
上大學之後，我漸漸發現我並不習慣發問。
我會在課後向老師發問，偶爾也會在課堂中發問。但是我通常不會問到懂，只要抓到關鍵字之後，我會在課後用這些關鍵字在課本和網路上，找出能讓我接受的答案。
然後有一次，我突然發現，這種方式固然增進了我自學的能力；但是一旦有盲點存在，它將一直在那裡。
我於是開始回想──是什麼讓我不習慣發問、甚至是不敢發問？
然後我想起了國中的理化課。
老師曾經很嚴肅的告訴我們，如果問一些她已經講過的內容她會生氣。也曾經當我在課堂上或課後發問的時候板下臉告訴我，她稍後就會教到這哩，讓我不要急。我知道老師是不希望為了一再向不專心的同學重複已經講過的內容而影響到上課的進度，然而很多時候，做為學生的我們其實很難分辨哪些問題是可以問的。「萬一老師已經講過了怎麼辦？」、「萬一老師等等就要講怎麼辦？」、「萬一大家都會只有我不會怎麼辦？」、「萬一這是一個笨問題怎麼辦？」‧‧‧ ‧‧‧ 。
漸漸的，我不僅僅不再輕易發問。
我變得無法理解，為什麼有的人，連google和課本可以回答的問題也要一直問？
但是提問本身並沒有錯。我不應該畏懼發問，更不應該拒絕別人向我提問。這是一個互相學習、互相幫助的過程，只有並肩同行，才能走得更遠。
這是一個可以發問的社群，是一個發問了之後會有人付出時間精力仔細回答的社群。受到這樣的氛圍感動的同時，我也可以學習去成為這些人的一份子。
在成長的路上，你我相隨，並不寂寞
當我想偷懶，或想放棄時，我會再次想起這場會議。
或者在朋友間，在彼此的專業領域間慨慨而談；或者在遇到外國講者時，也能用流利的應為彼此交流意見；或者直接站在講台上，和同好分享自己的心得... ...
我清楚的認知到，這是我所嚮往的生活。
因此，我勢必不能輕易地停下腳步。
深夜疲憊或挫折厭倦時，我知道在社團中有一群人，和我朝著類似的方向相伴而行。
所以，我的朋友們，讓我們並肩而行。當我不耐煩於你的提問時請你提醒我，當我在內心中呼救卻不敢開口時請你向我伸出手。
在成長的路上，請讓我們彼此相隨，不要成為一座孤島。
致謝
感謝我的老師 薛幼苓幫我寫推薦信，讓我能夠參加這次conference。
感謝舉辦Modern Web 2015的每一個人，謝謝你們讓這次conference成功舉辦；更謝謝你們提供免費的名額給學生，讓我有參加這次conference的機會。
感謝這次conference中所有的講者，你們無私的分享，真的讓我獲益良多。
謝謝你們所有的人！