Conditional Image Generation with PixelCNN Decoders
NIPS 2016
@google.com
Contents
Introduction
- Generate pictures pixel by pixel
- Related Works
- PixelRNN: better performance
- PixelCNN: faster to train (easier to parallelize)
- Gated PixelCNN
- Conditional variant of the Gated PixelCNN
PixelRNN and PixelCNN
The Distribution of PixelRNNs
$p(x) = \Pi_{i=1}^{n^2}p(x_i|x_1, …,x_{i-1})$
- $x$: input picture
- $x_i$: a single pixel
Masking
to make sure the CNN can only use information about pixels above and to the left of the current pixel
3 Color Channels
- B conditioned on (R, G)
- G conditioned on R
- first layer: mask A, otherwise: mask B
Gated PixelCNN
Gated Convolutioal Layers
Gated Activation Unit
$y = tanh(W_{k,f} * x) \odot \sigma(W_{k,g} * x)$
- $k$: the number of the layer
- $\odot$: element-wise product
- $*$: convolution operator
Blind spot
A single layer block of a Gated PixelCNN
- Notations
- green: convolution operations
- red: element-wise multiplications and additions
- blue: splites feature maps
- Left part: vertical stack
- Right part: horizontal stack
Conditional PixelCNN
$p(x|h) = \Pi_{i=1}^{n^2} p(x_i|x_1, …, x_{i-1},h)$
- $h$: a latent vector, image description
$y = tanh(W_{k,f} * x + V_{k,f}^T h) \odot \sigma(W_{k,g} * x + V_{k,g}^T h)$
- Applications of $h$
- class dependent bias
- what should be in the image
- location dependent bias
- where
- class dependent bias
Location Dependent
mapping $h$ to a spatial representation $s = m(h)$
where $s$ has the same width and height as the image
$y = tanh(W_{k,f} * x + V_{k,f} * s) \odot \sigma(W_{k,g} * x + V_{k,g} * s)$
PixelCNN Auto-Encoders
- Replacing the deconvolutional decoder with a conditional PixelCNN
Experiments
Unconditional Modeling with Gated PixelCNN
Performance of different models on CIFAR-10
Performance of different models on ImageNet
Conditioning on ImageNet Classes
Conditioning on Portrait Embeddings
PixelCNN Auto-Encoder
Reference
Conditional Image Generation with PixelCNN Decoders