Posted 2019-07-17Updated 2021-03-31Paper Notes0 visits

Residual Learning for Image Recognition

CVPR 2016
@microsoft.com

pdf

Abstract

residual networks
VGG nets
COCO object detection dataset

1. Introduction

Is learning better networks as easy as stacking more layers?

Problem: vanishing/exploding gradients
- Hamper convergence from the beginning
Solutions: normalized initialization and itermediate normalization layers
- Tens of layers converge with SGD and back propagation

Degradation problem

With the network depth increasing, accuracy gets saturated and then degrades rapidly
Not caused by overfitting
Adding more layers to a suitably deep model leads to highter training error

Deep residual learning framework

desired underlying mapping $H(x)$
another mapping $F(x):=H(x)-x$
original mapping recasting $F(x)+x$
If an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers
identity mapping: $F(x)=x$

Shortcut connections

Shortcut connections are those skipping one or more layers, simply perform identity mapping

ImageNet

Residual nets: easy to optimize, plain nets: higher training error
Accuracy
152-layer residual net (deepest)
Lower complaxity than VGG nets
Error 3.57%

Residual Representations

Powerful shallow representations for image retrieval and classification
- VLAD: encodes by the residual vectors with respect to a dictionary
- Fisher Vector: probabilistic version of VLAD
Encoding residual vectors is shown to be more effective than encoding original vectors
For solving Partial Differential Equations (PDEs)
- Multigrid method
- Hierarchical basis preconditioning

Shortcut Connections

Learning residual functions -> identity shortcuts are never closed

3. Deep Residual Learning

Residual Learning

Hypothesizes
- multiple nonlinear layers can asymptotically approximate complicated functions
- they can asymptotically approximate the residual functions
If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart
Residual learning reformulation
- If identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings

Identity Mapping by Shortcuts

$y = F(x, {W_i}) + x$

$y = F(x, {W_i}) + W_s x$

$W_s$ for matching dimensions

Network Architectures

Plain Network
- Based on VGG-net
- Design rules
  - same output feature map size
  - same number of filters
  - If the feature map size is halved, the number of filters is doubled
Residual Network
- based on plain network
- add some shortcuts and turn it into residual counterpart version
- identity mapping doesn’t increase the parameter in the network
- use project matrix to make the dimension match between $F(x)$ and $x$

Implementation

Batch normalization
SGD
Learning rate: 0.1
Weight decay of 0.0001 and a momentum of 0.9
Don’t use dropout, because the percentage of parameters in the fully connected layer is low. (about 0.01%)

4. Experiments

Plain Networks

Degradation problem

Residual Networks

Three major observations
- 34-layer ResNet is better than the 18-layer ResNet
- Training error successfully reduced
- The 18-layer ResNet converges faster (providing faster convergence at the early stage)

Identity vs. Projection Shortcuts

(A) Zero-padding
(B) Projection for increasing dimensions and other are identity
(C) All are projections
C > B > A (slightly)

Deeper Bottleneck Architectures

50-layer ResNets
101-layer and 152-layer ResNets

Reference

Residual Learning for Image Recognition

https://tracyliu1220.github.io/2019/07/17/2019-07-17-Residual-Learning-for-Image-Recognition/

Author

Tracy Liu

Posted on

2019-07-17

Updated on

2021-03-31

Licensed under

#deep learning

Residual Learning for Image Recognition

Contents

Abstract

1. Introduction

Is learning better networks as easy as stacking more layers?

Degradation problem

Deep residual learning framework

Shortcut connections

ImageNet

Residual Representations

Shortcut Connections

3. Deep Residual Learning

Residual Learning

Identity Mapping by Shortcuts

Network Architectures

Implementation

4. Experiments

Plain Networks

Residual Networks

Identity vs. Projection Shortcuts

Deeper Bottleneck Architectures

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Residual Learning for Image Recognition

Contents

Abstract

1. Introduction

Is learning better networks as easy as stacking more layers?

Degradation problem

Deep residual learning framework

Shortcut connections

ImageNet

2. Related Work

Residual Representations

Shortcut Connections

3. Deep Residual Learning

Residual Learning

Identity Mapping by Shortcuts

Network Architectures

Implementation

4. Experiments

Plain Networks

Residual Networks

Identity vs. Projection Shortcuts

Deeper Bottleneck Architectures

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue