Neural Architecture Search With Reinforcement Learning,
ICLR 2017, by Barret Zoph, Quoc V. Le (Google Brain)
This paper given a approach to find a better structure of network by reinforcement learning.
They use a recurrent network (RNN) as the controller to generate the model descriptions of neural networks, then train this RNN to maximize the expected accuracy of the generated architectures with reinforcement learning.
This could save lots of test by human to change every possible parameters of a network one by one.
However, as the first paper to slove this problem by reinforcement learning, strong computational resources will be required to implement this idea.
The main contribution in this paper is the idea to build the controller by a RNN:
Our work is based on the observation that the structure and connectivity of a neural network can be typically specified by a variable-length string. It is therefore possible to use a recurrent network – the controller – to generate such string.
This limited the search space of networks into a variable-length string.
Here, this paper propose a supervisor signal, addivitive angular margin (ArcFace), which has a better geometrical interpretation than suoervision singals proposed before this paper.
Moreover, this paper introduced many recent face recognition modes and loss functions they use, given us a clear overview of this area.
- Three primary attributes makes enbedding differ:
- Training data
- Network architecture
- Loss functions
- Lose functions -- From Softma to ArcFAce
Softmax (Only want to do classification well):
Could not explicitly optimise the features to have higher similarity score for positive pairs and lower similarity score for negative pairs, which leads to a performance gap.
L2 weight normalisation only improves little on performance.
Multiplicative Angular Margin:
The additional dynamic hyper-parameter λ will make the training of SphereFace relatively tricky.
L2 normalisation on features and weights is an important step for hypersphere metric learning. The intuitive insight behind feature and weight normalisation is to remove the radial variation and push every feature to distribute on a hypersphere manifold.
Additive Cosine Margin:
(1) Extremely easy to implement without tricky hyper-parameters
(2) More clear and able to converge without the Softmax supervision
(3) Obvious performance improvement
Additive Angular Margin:
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings
This work given a approach to map audio signals to phoneme sequences without phoneme-labeled audio data.
This framework consists of three parts:
1. Audio to Vector
divides the audio utterances into segments and obtains the audio embeddings for the segments
2. K-means clustering
cluster all audio embeddings and assigned cluster indices
3. Mapping relationship GAN
the generator produces predicted phoneme vector sequences from cluster index sequences, and the discriminator is trained to distinguish the predicted and the real phoneme vector sequences collected from text sentences and the lexicon
Zhou et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017
- Raw Point Cloud:
使用64環的LIDAR Sensor蒐集而得 (頻率約為1秒10圈, 每圈約可收集到十萬個反射點), 每個單一的點共有4個維度的資料: 座標位置(x, y, z)以及反射強度(Intensity)。
Input: Raw Point Cloud
Output: 3D Bounding Box Prediction
KITTI Dataset (7000 data with bounding box annotation)
- Step 1: Feature Learning Network
- Input: Raw Point Cloud
- Output: 4D Sparse Tensor (Feature Map)
- Step 2: Convolution Middle Layers
- Input: 4D Sparse Tensor (Feature Map)
- Output: High-Level Feature (Reducing the size)
- Step 3: Region Proposal Networks
- Input: High-Level Feature (4D)
- Output: Class and Bounding Box
How to visualize high-dimensional data
giving each datapoint a location in a two or three-dimensional map
a. easier to optimize than other existing techniques
b. could reduce the tendency to crowd points together in the center of the map
c. could create a single map that reveals structure at many different scales
d. could use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed
Li Wan et. al. “A Hybrid Neural Network-Latent Topic Model”, JMLR 2012
This paper introduces a hybrid model that combines a neural network with a latent topic model.
This paper combine scale-invariant feature transform (SIFT), neural network, and Latent Dirichlet allocation (LDA) to perform scence classification, where the hybrid model is shown to outperform models based solely on neural networks or topic models. The main contribution is the way to combine neural network and LDA.
The neural work here acts as a trainable feature extractor to provide a low-dimensional embedding for the input data; while the topic model captures the group structure of the data.
Contribution and Disscussion
Propose an iterative online algorithm which could slove the dictionary learning problem by efficienty minimizing at eacch step a quadratic surrogate function of the empirical cost over the set of constraints.
This algorithm is faster than previous approaches to dictionary learning on both small and large datasets of natural images.
Background Knowledge -- Sparse Coding
1. Change of basis
In linear algebra, we may change the coordinates relative to a basis to another, in order to make the problem we are going to solve more easily.
Using different coordinates is like we look at the problem in a different view. This may helps us to find some hidden unformation and structure lies behind.
2. Over-complete set of basis vectors
After removing one basis from the set, the basis set is still complete (every element in input set X can be approximated arbitrarily well in norm by finite linear combinations of basis in basis set).
3. Sparse coding (sparse dictionary learning)
Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set.
This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.
(Copy from Wiki)
a. [Adv] Could capture structures and patterns inherent in the input data
b. [Con] With an over-complete basis, the coefficients ai are no longer uniquely determined by the input vector
c. The sparse coding cost function on a set of m input vectors was defined as:
where is a sparsity cost function which penalizes for being far from zero.
Classical dictionary learning problem:
could bw defined as the optimal value of the problem:
Under the constraint ( C is the convex set of matrices), to prevent D from being arbitrarily large (which would lead to arbitrarily small values of α):
Algorithm Outline of Online Dictionary Learning
Peng et al., ICCV 2016
Want to use CAD models to generate synthetic 2D images for training. This paper arms to deal with the lacking cue information, such as texture, pose and background.
- Field: Few-Shot Object Detection
Cue invariance: The ability of the network to extract the equivalent high-level category information despite missing low-level cues.
Contribution and Discussion
This paper roposed a new method for learning object detectors for new categories that avoids the need for costly large-scale image annotation based on findings as follow:
- Demonstrated that synthetic CAD training of modern deep CNNs object detectors can be successful when real-image training data for novel objects or domains is limited.
- Show that training on synthetic images with simulated cues lead to similar performance as training on synthetic images without cues.
- For new categories, adding synthetic variance and fine-tuning the layers proved useful.
1. Synthetic Generation of Low-Level Cues
Given a set of 3D CAD models for each object, it generates a synthetic 2D image training dataset by simulating various low-level cues.
The chooisen subset of cue information are: object texture, color, 3D pose and 3D shape, as well as background scene texture and color. When learning a detection model for a new category with limited labeled real data, the choice of whether or not to simulate these cues in the synthetic data depends on the invariance of the representation. For example, if the representation is invariant to color, grayscale images can be rendered.
To generating a virtual image:
- randomly select a background image from the available background pool
- project the selected background image onto the image plane
- select a random texture image from the texture pool
- map the selected texture image onto the CAD model before rendering the object.
2. Deep Convolutional Neural Network Features
To extract positive and negative patches for each object from the synthetic images (and an optional small number of real images).
Each patch is fed into a deep neural network that computes feature activations, which are used to train the final classifier, as in the deep detection method of RCNN.
3. Analysing Cue Invariance of DCNN Features
Here they create two synthetic training sets, one with and one without a particular cue. They extract deep features from both sets, train two object detectors, and compare their performance on real test data.
- Hypothesis: If the representation is invariant to the cue, then similar high-level neurons will activate whether or not that cue is present in the input image, leading to similar category-level information at training and thus similar performance.
Here is their results:
Veit at al., CVPR 2017.
To learn the mapping between noisy and clean annotations in order to leverage the information cantained n the clean set.
- Field: Object Classification
- Assumption: large data with noisy annotations and small cleanly-annotated subset
- Dataset: OpenImage (~ 9 million data with ~40k cleanly-annotated images)
- label space is typically highly structured
- many classes aan have multiple semantic modes
- Want: To find dependence of annotation noise on the input image
There are two supervised classifiers in this approach. One is the label cleaning network which using a residual architecture to learn the difference between the noisy and clean labels. The other one is the image classifier which takes the label predicted by the label cleaning network as the ground truth if the given image doesn't have a clean label.
The label cleaning network has some cool ideas list as follows:
1. Two Separate Input
- Two separate inputs:
- the noisy label
- the visual features (extracted by Inception V3)
- To mode the label structure and noisy condition on the image
- The sparse noisy labels y is treates as a bag of words
- Both y and the visual features are projected into a low dimensonal embedding that encodes the set of labels
- The embedding vectors are concatenated, transformed with a hidden layer, than projection back into the high dimensional label space
2. Identity-skip connection
This procedure were inspired by another cool paper written by K He, Deep Residual Learning for Image Recognition, the best paper of CVPR 2016. They reformulated the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions, proved that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
This approach adds the noisy labels from the training set to the output of the cleaning module. As a result, the network only needs to learn the difference between the noisy and clean labels instead of regressing the entire label vector.
When no human rated data is available, the label cleaning network defaults to not changing the noisy labels. As more verified groundtruth becomes available, the network gracefully adapts and cleans the labels.
Here is the result:
Weekly Paper Reading List
by Ying Li (R05922016), aMMAI course
- Yunchao et al. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. CVPR 2011
Lecture 03 -- not enough labeled data
- Veit et al. Learning From Noisy Large-Scale Datasets With Minimal Supervision. CVPR 2017.
- Peng et al., Learning Deep Object Detectors from 3D Models. ICCV 2016.
Lecture 05 -- Latent Semantic Analysis
- Li Wan et. al. A Hybrid Neural Network-Latent Topic Model, JMLR 2012
Lecture 06 -- sparse coding
- Mairal et al. Online dictionary learning for sparse coding. ICML 2009
- Laurens et al. Visualizing Data using t-SNE; The Journal of Machine Learning Research, 2008.
Lecture 10 - Learning from 3D Sensors
- Zhou et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017
Lecture 11 - Speech
- Da-Rong et al. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings