Attention is All You Need
NIPS 2017
@google.com
Contents
Introduction
- Recurrent Models
- sequence operations
- hard to do parallelization
- the Transformer
- no recurrence
- relying entirely on an attention mechanism
- global dependencies
- parallelization with 8 GPUs
Self-attention
an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence
Model Architecture
$d_{model}=512$
Encoder
- $N$ = 6
Decoder
- $N$ = 6
- masking: self-attention layer is only allowed to attend to earlier positions in the output sequence
Attention
Scaled Dot-Product Attention
$Attention(Q, K, V) = softmax(\frac{QK^\top}{\sqrt{d_k}})V$
divided by $\sqrt{d_k}$ to prevent extremely large
Multi-Head Attention
$MultiHead(Q, K, V) = Concat(head_1, …, head_h)W^O$
where $head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$
in the work
- $h = 8$
- $d_k = d_v = d_{model}/h = 64$
- $h$ projections for doing parallelization
Applications
- encoder-decoder attention
- queries: from previous decoder layer
- keys and values: from encoder
- allows every position in the decoder to attend over all positions in the input sequence
- encoder self-attention
- decoder self-attention
- to prevent leftward information flow: masking out (setting to -inf)
Positional Encoding
$PE_{(pos,2i)} = sin(pos/10000^{2i}/d_{model})$
$PE_{(pos,2i+1)} = cos(pos/10000^{2i}/d_{model})$
- $pos$: position
- $i$: dimension
Why Self-Attention
Training
dataset
- WMT 2014 English-German dataset
- 4.5 million sentence pairs
- vocabulary: about 37000 tokens
- WMT 2014 English-French dataset
- 36M sentences
- vocabulary: 32000 tokens
hardware and schedule
- 8 GPUs
- 100000 steps (each took about 0.4 sec) 12 hours
- big models
- step time 1.0 sec
- 300000 steps (3.5 days)
optimizer
- Adam optimizer
Results
(A) vary the number of attention heads and the attention key and value dimensions
(B) reducing the attention key size $d_k$ hurts model quality
(C) and (D) show that bigger models are better and dropout is helpful in avoiding over-fitting
(E) replacing sinusoidal positional encoding with learned positional embeddings (early identical)
Reference
Attention is All You Need
https://tracyliu1220.github.io/2019/08/16/2019-08-16-Attention-is-All-You-Need/