Sequence to Sequence Learning with Neural Networks

NIPS 2014
@google.com

Contents

pdf

Abstract

  • Task: an English to French translation task from the WMT-14 dataset
  • Method:
    • a Deep LSTM: maps theinput sequence to a vector of a fixed dimensionality
    • another Deep LSTM: decodes the target sequence from the vector
  • Result: BLEU score 34.8
  • Additional Founds: reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly

1 Introduction

  • DNNs’ inpus and targets: encoded with vectors of fixed dimensionality.
  • The idea: use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector

2 The model

RNN

  • A natural generalization of feedforward neural networks to sequences
  • Difficult to train the RNNs due to the resulting long term dependencies

LSTM

  • 2 LSTM: one for the input sequence and another for the output sequence
  • Deep LSTMs with 4 layers
  • Reverse the order of the words of the input sentence

3 Experiments

3.1 Dataset details

  • 348M French words
  • 304M English words
  • Vector representation
    • 160000 most frequent words for the source language (English)
    • 80000 most frequent words for the target language (French)
    • Out-of-vocabulary: UNK

3.2 Decoding and Rescoring

  • Maximizing the log probability of a correct translation T given the source sentence S
  • Beam search decoder

3.3 Reversing the Source Sentences

  • original source: has a large “minimal time lag”
  • reversed source: the first few words in the source language are very close to the first few words in the target language

3.4 Training details

  • LSTMs
    • 4 layers
    • 1000 cells
    • Parameters: uniform distribution between -0.08 and 0.08
  • input vocabulary: 160,000
  • output vocabulary: 80,000
  • SGD, without momentum, learning rate 0.7
  • 7.5 epochs, after 5 epochs, halving the learning rate every half epoch
  • Batches of 128
  • Exploding gradients prevending
  • Roughly same-length-sentences in the minibatch

3.5 Parallelization

  • 8-GPU machine
    • 4 for 4 layers of LSTMs
    • 4 for computing the softmax
  • Training took about a ten days

3.6 Experimental Results


3.7 Performance on long sentences

3.8 Model Analysis

  • Turn a sequence of words into a vector
    • Clustered by meaning
    • Sensitive to the order of words
    • Insensitive to the active and passive
  • RNN-Language Model (RNNLM)
  • Feedforward Neural Network Language Model (NNLM)
  • Auli et al: combine an NNLM with a topic model of the input sentence
  • Kalchbrenner and Blunsom: map sentences to vectors using convolutional neural networks -> lose the ordering

5 Concludsion

  • A large deep LSTM with a limited vocabulary does well
  • Improved by reversing the words -> has the greatest number of short term dependencies
  • LSTM can correctly translate very long sentences

Reference

Author

Tracy Liu

Posted on

2019-07-25

Updated on

2021-03-31

Licensed under

Comments