9 months ago

Veit at al., CVPR 2017.

Poblem Definition

To learn the mapping between noisy and clean annotations in order to leverage the information cantained n the clean set.

  • Field: Object Classification
  • Assumption: large data with noisy annotations and small cleanly-annotated subset
  • Dataset: OpenImage (~ 9 million data with ~40k cleanly-annotated images)
  • Aspects:
    1. label space is typically highly structured
    2. many classes aan have multiple semantic modes
  • Want: To find dependence of annotation noise on the input image


Network Architecture

There are two supervised classifiers in this approach. One is the label cleaning network which using a residual architecture to learn the difference between the noisy and clean labels. The other one is the image classifier which takes the label predicted by the label cleaning network as the ground truth if the given image doesn't have a clean label.

The label cleaning network has some cool ideas list as follows:
1. Two Separate Input

  • Two separate inputs:
    • the noisy label
    • the visual features (extracted by Inception V3)
  • Goal:
    • To mode the label structure and noisy condition on the image
  • Process:
    • The sparse noisy labels y is treates as a bag of words
    • Both y and the visual features are projected into a low dimensonal embedding that encodes the set of labels
    • The embedding vectors are concatenated, transformed with a hidden layer, than projection back into the high dimensional label space

2. Identity-skip connection
This procedure were inspired by another cool paper written by K He, Deep Residual Learning for Image Recognition, the best paper of CVPR 2016. They reformulated the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions, proved that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

This approach adds the noisy labels from the training set to the output of the cleaning module. As a result, the network only needs to learn the difference between the noisy and clean labels instead of regressing the entire label vector.

When no human rated data is available, the label cleaning network defaults to not changing the noisy labels. As more verified groundtruth becomes available, the network gracefully adapts and cleans the labels.

Here is the result:

← Weekly Paper Reading -- Paper List [Paper Reading] Learning Deep Object Detectors from 3D Models →
comments powered by Disqus