about 1 month ago

Peng et al., ICCV 2016

Poblem Definition

Want to use CAD models to generate synthetic 2D images for training. This paper arms to deal with the lacking cue information, such as texture, pose and background.
Synthetic Images

  • Field: Few-Shot Object Detection

Cue invariance: The ability of the network to extract the equivalent high-level category information despite missing low-level cues.

Contribution and Discussion

This paper roposed a new method for learning object detectors for new categories that avoids the need for costly large-scale image annotation based on findings as follow:

  • Demonstrated that synthetic CAD training of modern deep CNNs object detectors can be successful when real-image training data for novel objects or domains is limited.
  • Show that training on synthetic images with simulated cues lead to similar performance as training on synthetic images without cues.
  • For new categories, adding synthetic variance and fine-tuning the layers proved useful.


Network Architecture

1. Synthetic Generation of Low-Level Cues
Given a set of 3D CAD models for each object, it generates a synthetic 2D image training dataset by simulating various low-level cues.

The chooisen subset of cue information are: object texture, color, 3D pose and 3D shape, as well as background scene texture and color. When learning a detection model for a new category with limited labeled real data, the choice of whether or not to simulate these cues in the synthetic data depends on the invariance of the representation. For example, if the representation is invariant to color, grayscale images can be rendered.

To generating a virtual image:

  1. randomly select a background image from the available background pool
  2. project the selected background image onto the image plane
  3. select a random texture image from the texture pool
  4. map the selected texture image onto the CAD model before rendering the object.

2. Deep Convolutional Neural Network Features
To extract positive and negative patches for each object from the synthetic images (and an optional small number of real images).
Each patch is fed into a deep neural network that computes feature activations, which are used to train the final classifier, as in the deep detection method of RCNN.

3. Analysing Cue Invariance of DCNN Features
Here they create two synthetic training sets, one with and one without a particular cue. They extract deep features from both sets, train two object detectors, and compare their performance on real test data.

  • Hypothesis: If the representation is invariant to the cue, then similar high-level neurons will activate whether or not that cue is present in the input image, leading to similar category-level information at training and thus similar performance.

Here is their results:

← [Paper Reading] Learning From Noisy Large-Scale Datasets With Minimal Supervision [Paper Reading] Online Dictionary Learning for Sparse Coding →
comments powered by Disqus