Posted 2019-08-16Updated 2021-03-31Paper Notes0 visits

Effective Approaches to Attention-based Neural Machine Translation

EMNLP 2015
@stanford.edu

pdf

Introduction

Attentional mechanism jointly translate and align words
2 types of attention-based models
- global
  - all source words
- local
  - subset of source words

Neural Machine Translation

Probabilistic model

$p(y|x)$

input

$x_1, …, x_n$

output

$y_1, …, y_m$

2 components

encoder: to compute a representation $s$ of source sentences
decoder: to generate one target word at a time

$log\ p(y|x) = \Sigma_{j=1}^m log\ p(y_j|y_{<j}, s)$

using LSTM

the probability of decoding each word $y_j$

$p(y_j|y{<j}, s) = softmax(g(h_j))$

$g$ is the transformation fuction that outputs a vocabulary-sized vector

training objective

$J_t = \Sigma_{(x,y)\in\mathbb{D}}-log\ p(y|x)$

Attention-based Models

context vector $c_t$
- captures relevant source-side information to help predict the current target word $y_t$
- weighted average over all the source hidden states
align weights $a_t$

$\tilde{h}_t = tanh(W_c[c_t;h_t])$

$p(y_t|y_{<t},x) = softmax(W_s\tilde{h}_t)$

Global Attention

$a_t(s) = align(h_t, \bar{h}s) = \frac{exp(score(h_t, \bar{h}_s))}{\Sigma{s’}exp(score(h_t, \bar{h}_{s’}))}$

$$
score(h_t,\bar{h}_s) =
\begin{cases}
h_t^\top\bar{h}_s \
h_t^\top W_a\bar{h}_s \
v_a^\top tanh(W_a[h_t^\top;\bar{h}_s])
\end{cases}
$$

dot, general, concat

Local Attention

window $D$
- $c_t$ is derived as a weighted average within the window $[p_t-D,p_t+D]$
- empirically selected
$a_t$
- fixed dimension $2D+1$
- Gaussian distribution centered around $p_t$
- $a_t(s) = align(h_t, \tilde{h}_s)exp(- \frac{(s - p_t)^2}{2\sigma^2})$

Monotonic alignment (local-m)

$p_t = t$

Predictive alignment (local-p)

$p_t = S \cdot sigmoid(v_p^\top tanh(W_p h_t))$

source sentence length $S$
model parameters (to be learnt): $v_p^\top$, $W_p$

Input-feeding Approach

to keep tracking of which source words have been translated
alignment decisions should be made jointly taking into account past alignment information

Experiments

Training Details

WMT’14 training data
- 4.5M sentences pairs
- 116M English words, 110M German words
- limit vocabularies to 50K with token <unk>
LSTM models
- 4 layers
- 1000 cells
parameters: uniformly initialized in [-0.1, 0.1]
10 epochs using SGD

Results

Effective Approaches to Attention-based Neural Machine Translation

https://tracyliu1220.github.io/2019/08/16/2019-08-16-Effective-Approaches-to-Attention-based-Neural-Machine-Translation/

Author

Tracy Liu

Posted on

2019-08-16

Updated on

2021-03-31

Licensed under

#deep learning

Effective Approaches to Attention-based Neural Machine Translation

Contents

Introduction

Neural Machine Translation

Probabilistic model

input

output

2 components

using LSTM

the probability of decoding each word $y_j$

training objective

Attention-based Models

Global Attention

Local Attention

Monotonic alignment (local-m)

Predictive alignment (local-p)

Input-feeding Approach

Experiments

Training Details

Results

Author

Posted on

Updated on

Licensed under

Comments

Catalogue