Learning Deep Features for Discriminative Localization

CVPR 2016
@csail.mit.edu

Contents

pdf

Introduction

Problem

In some CNN models, the objects localization ability is lost when fully-connected layers are used for classification

Goal

To both classify the image and localize class-specific image regions in a single forward-pass

  • Weakly-supervised object localization
    • Bergamo et al, Cinbis et al, Pinheiro et al, Oquab et al
    • They are not trained end-to-end
  • Visualizing CNNs
    • visualize the internal representation learned by CNNs in an attempt to better understand their properties
    • Zeiler et al, Mahendran et al, Dosovitskiy et al

Class Activation Mapping (CAM)

using global average pooling (GAP)

  • $f_k(x, y)$: the activation of unit $k$ in the last convolutional layer at spatial location $(x, y)$
  • $F^k = \sum_{x, y}f_k(x, y)$: the result of perfoming global average pooling for unit $k$
  • $w_k^c$: the importance of $F_k$ for class $c$
  • $S_c = \sum_kw_k^cF_k$: the input of the softmax
  • $P_c = \frac{exp(S_c)}{\sum_c exp(S_c)}$: the output of the softmax for class $c$
  • $M_c$: class activation map for class c

$F^k = \sum_{x, y}f_k(x, y)$

$S_c = \sum_k w_k^c \sum_{x,y} f_k(x, y) = \sum_{x,y}\sum_k w_k^cf_k(x,y)$

$M_c(x,y) = \sum_kw_k^cf_k(x,y)$

$S_c = \sum_{x,y} M_c(x,y)$

Global Average Pooling vs. Global Max Pooling

  • GAP
    • loss encourages the network to identify the extent of the object
  • GMP
    • loss encourages to identify just one discrimminative part

Weakly-supervised Object Localization

Setup

  • Originals
    • AlexNet
    • VGGnet
    • GoogLeNet
  • Remove the fully-connected layers before the final output and replace them with GAP followed by a fully-conneted softmax layer
    • AlexNet-GAP
    • VGGnet-GAP
    • GoogLeNet-GAP
  • AlexNet*-GAP
    • AlexNet is the most affected by the removal of the fully-connected layers
    • add two convolutional layers just before GAP

Results

Classification

In most cases there is a small performance drop of 1 − 2% when removing the additional layers from the various networks

Localization

Generate bounding boxes: First segment the regions of which the value is above 20% of the max value of the CAM. Then take the bounding box that covers the largest connected component in the seg-mentation map

Deep Features for Generic Localization

Fine-grained Recognition

Pattern Discovery

Discovering informative objects in the scenes

Concept localization in weakly labeled images

Weakly supervised text detector

Interpreting visual question answering

Visualizing Class-Specific Units

Reference

Author

Tracy Liu

Posted on

2019-09-12

Updated on

2021-03-31

Licensed under

Comments