Learning Deep Features for Discriminative Localization
CVPR 2016
@csail.mit.edu
Contents
Introduction
Problem
In some CNN models, the objects localization ability is lost when fully-connected layers are used for classification
Goal
To both classify the image and localize class-specific image regions in a single forward-pass
Related Works
- Weakly-supervised object localization
- Bergamo et al, Cinbis et al, Pinheiro et al, Oquab et al
- They are not trained end-to-end
- Visualizing CNNs
- visualize the internal representation learned by CNNs in an attempt to better understand their properties
- Zeiler et al, Mahendran et al, Dosovitskiy et al
Class Activation Mapping (CAM)
using global average pooling (GAP)
- $f_k(x, y)$: the activation of unit $k$ in the last convolutional layer at spatial location $(x, y)$
- $F^k = \sum_{x, y}f_k(x, y)$: the result of perfoming global average pooling for unit $k$
- $w_k^c$: the importance of $F_k$ for class $c$
- $S_c = \sum_kw_k^cF_k$: the input of the softmax
- $P_c = \frac{exp(S_c)}{\sum_c exp(S_c)}$: the output of the softmax for class $c$
- $M_c$: class activation map for class c
$F^k = \sum_{x, y}f_k(x, y)$
$S_c = \sum_k w_k^c \sum_{x,y} f_k(x, y) = \sum_{x,y}\sum_k w_k^cf_k(x,y)$
$M_c(x,y) = \sum_kw_k^cf_k(x,y)$
$S_c = \sum_{x,y} M_c(x,y)$
Global Average Pooling vs. Global Max Pooling
- GAP
- loss encourages the network to identify the extent of the object
- GMP
- loss encourages to identify just one discrimminative part
Weakly-supervised Object Localization
Setup
- Originals
- AlexNet
- VGGnet
- GoogLeNet
- Remove the fully-connected layers before the final output and replace them with GAP followed by a fully-conneted softmax layer
- AlexNet-GAP
- VGGnet-GAP
- GoogLeNet-GAP
- AlexNet*-GAP
- AlexNet is the most affected by the removal of the fully-connected layers
- add two convolutional layers just before GAP
Results
Classification
In most cases there is a small performance drop of 1 − 2% when removing the additional layers from the various networks
Localization
Generate bounding boxes: First segment the regions of which the value is above 20% of the max value of the CAM. Then take the bounding box that covers the largest connected component in the seg-mentation map
Deep Features for Generic Localization
Fine-grained Recognition
Pattern Discovery
Discovering informative objects in the scenes
Concept localization in weakly labeled images
Weakly supervised text detector
Interpreting visual question answering
Visualizing Class-Specific Units
Reference
Learning Deep Features for Discriminative Localization