Attention is All You Need

NIPS 2017
@google.com

Contents

pdf

Introduction

  • Recurrent Models
    • sequence operations
    • hard to do parallelization
  • the Transformer
    • no recurrence
    • relying entirely on an attention mechanism
    • global dependencies
    • parallelization with 8 GPUs

Self-attention

an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence

Model Architecture

$d_{model}=512$

Encoder

  • $N$ = 6

Decoder

  • $N$ = 6
  • masking: self-attention layer is only allowed to attend to earlier positions in the output sequence

Attention

Scaled Dot-Product Attention

$Attention(Q, K, V) = softmax(\frac{QK^\top}{\sqrt{d_k}})V$

divided by $\sqrt{d_k}$ to prevent extremely large

Multi-Head Attention

$MultiHead(Q, K, V) = Concat(head_1, …, head_h)W^O$

where $head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$

in the work

  • $h = 8$
  • $d_k = d_v = d_{model}/h = 64$
  • $h$ projections for doing parallelization
Applications
  • encoder-decoder attention
    • queries: from previous decoder layer
    • keys and values: from encoder
    • allows every position in the decoder to attend over all positions in the input sequence
  • encoder self-attention
  • decoder self-attention
    • to prevent leftward information flow: masking out (setting to -inf)

Positional Encoding

$PE_{(pos,2i)} = sin(pos/10000^{2i}/d_{model})$

$PE_{(pos,2i+1)} = cos(pos/10000^{2i}/d_{model})$

  • $pos$: position
  • $i$: dimension

Why Self-Attention

Training

dataset

  • WMT 2014 English-German dataset
    • 4.5 million sentence pairs
    • vocabulary: about 37000 tokens
  • WMT 2014 English-French dataset
    • 36M sentences
    • vocabulary: 32000 tokens

hardware and schedule

  • 8 GPUs
  • 100000 steps (each took about 0.4 sec) 12 hours
  • big models
    • step time 1.0 sec
    • 300000 steps (3.5 days)

optimizer

  • Adam optimizer

Results

(A) vary the number of attention heads and the attention key and value dimensions
(B) reducing the attention key size $d_k$ hurts model quality
(C) and (D) show that bigger models are better and dropout is helpful in avoiding over-fitting
(E) replacing sinusoidal positional encoding with learned positional embeddings (early identical)

Reference

The Illustrated Transformer

Author

Tracy Liu

Posted on

2019-08-16

Updated on

2021-03-31

Licensed under

Comments