4.6 Attention-based NMT

Effective Approaches to Attention-based Neural Machine Translation

Created Date: 2025-05-19

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT.

This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.

We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result inthe WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

4.6.1 Introduction

Neural Machine Translation (NMT) achieved state-of-the-art performances in large-scale translation tasks such as from English to French (Luong et al., 2015) and English to German(Jean et al., 2015). NMT is appealing since it requires minimal domain knowledge and is conceptually simple. The model by Luong et al. (2015) reads through all the source words until the end-of sentence symbol <eos> is reached. It then starts emitting one target word at a time, as illustrated in Figure 1.

NMT is often a large neural network that is trained in an end-to-end fashion and has the ability to generalize well to very long word sequences. This means the model does not have to explicitly store gigantic phrase tables and language models as in the case of standard MT; hence, NMT has a small memory footprint. Lastly, implementing NMT decoders is easy unlike the highly intricate decoders in standard MT (Koehn et al., 2003).

In parallel, the concept of "attention" has gained popularity recently in training neural networks, allowing models to learn alignments between different modalities, e.g., between image objects and agent actions in the dynamic control problem (Mnih et al., 2014), between speech frames and text in the speech recognition task (?), or between visual features of a picture and its text description in the image caption generation task (Xu et al., 2015).

In the context of NMT, Bahdanau et al. (2015) has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT.

In this work, we design, with simplicity and effectiveness in mind, two novel types of attention based models: a global approach in which all source words are attended and a local one whereby only a subset of source words are considered at a time.

The former approach resembles the model of (Bahdanau et al., 2015) but is simpler architecturally. The latter can be viewed as an interesting blend between the hard and soft attention models proposed in (Xu et al., 2015): it is computationally less expensive than the global model or the soft attention; at the same time, unlike the hard attention, the local attention is differentiable almost everywhere, making it easier to implement and train. Besides, we also examine various alignment functions for our attention-based models.

Experimentally, we demonstrate that both of our approaches are effective in the WMT translation tasks between English and German in both directions. Our attentional models yield a boost of up to 5.0 BLEU over non-attentional systems which already incorporate known techniques such as dropout.

For English to German translation, we achieve new state-of-the-art (SOTA) results for both WMT'14 and WMT'15, outperforming previous SOTA systems, backed by NMT models and n-gram LM rerankers, by more than 1.0 BLEU.

We conduct extensive analysis to evaluate our models in terms of learning, the ability to handle long sentences, choices of attentional architectures, alignment quality, and translation outputs.

4.6.2 Neural Machine Translation

A neural machine translation system is a neural network that directly models the conditional probability \(p(y|x)\) of translating a source sentence, \(x_1, \cdots, x_n\) to a target sentence \(y_1, \cdots, y_m\). A basic form of NMT consists of two components: (a) an encoder which computes a representation s for each source sentence and (b) a decoder which generates one target word at a time and hence decomposes the conditional probability as:

\(log p(y|x) = \sum_{j=1}^m log p(y_i|y \lt j, s)\)

A natural choice to model such a decomposition in the decoder is to use a recurrent neural network (RNN) architecture, which most of the recent NMT work such as Kalchbrenner and Blunsom, 2013. They, however, differ in terms of which RNN architectures are used for the decoder and how the encoder computes the source sentence representation s.

In this work, following Sutskever et al., 2014, we use the stacking LSTM architecture for our NMT systems, as illustrated with \(\mathbb{D}\) being our parallel training corpus.

4.6.3 Attention-based Models

Our various attention-based models are classifed into two broad categories, global and local. These classes differ in terms of whether the "attention" is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively.

Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches first take as input the hidden state \(h_t\) at the top layer of a stacking LSTM.

The goal is then to derive a context vector \(c_t\) that captures relevant source-side information to help predict the current target word \(y_t\). While these models differ in how the context vector \(c_t\) is derived, they share the same subsequent steps.

Specifically, given the target hidden state \(h_t\) and the source-side context vector \(c_t\), we employ a simple concatenation layer to combine the information from both vectors to produce an attentional hidden state as follows:

4.6.3.1 Global Attention

4.6.3.2 Local Attention

4.6.3.3 Input-feeding Approach

4.6.4 Experiments

4.6.5 Analysis

4.6.6 Conclusion