Evolution of Neural Machine Translation

Evolution of Neural Machine Translation

Evolution of machine translation from rule-based to transformer model-based

ยท

3 min read

  1. Part 1 : Evolution of Neural Machine Translation (NMT) Systems

  2. Part 2 : Pretrained Models for Neural Machine Translation (NMT) Systems

  3. Part 3 : Neural Machine Translation (NMT) using Hugging Face Pipeline

  4. Part 4 : Neural Machine Translation (NMT) using EasyNMT library

๐Ÿ“ŒOverview of Machine Translation

Machine Translation is an NLP task (specifically NLG task) that involves translating the text sequence in the source language to the text sequence in the target language.

As shown in the figure, the model translates the sentence from the English language to the Telugu language. Here the source language is English, and the target language is Telugu. Telugu is the fourth most spoken language in India.

The best example of a machine translation (MT) system is Google Translate. As shown in the figure, we can use Google Translate to translate text, a complete document, or even a web page. Google Translate supports more than 120 languages. On the top left, first you have to select the option (Text, Documents, or Websites) depending on whether you want to translate a piece of text, document, or webpage. On the left side, you have to choose the source language, and on the right side, you have to choose the target language. Once you enter the text or upload the document or specify the web page URL, you will get the translated version in the desired target language.

๐Ÿ“Œ Evolution of Machine Translation (from Rule-based to Neural Network-based)

The evolution of MT systems started with rule-based systems in the 1950s. Rule-based systems are challenging to develop and manage as they involve framing many rules and exceptions. In the 1990s, these rules-based systems are replaced by Statistical Machine Translational Systems (SMT). Though SMT systems are better than rule-based systems, they are not end-to-end and are based on statistical models whose parameters are obtained from analyzing a bilingual text corpus.

With the success of deep learning models in other NLP tasks, Neural Machine Translation (NMT) systems based on seq2seq models started to replace SMT systems. Unlike SMT systems, NMT systems are end-to-end, i.e., the models receive the text sequence to translate and then generate the translated text sequence. At the end of 2016, SMT systems are replaced with NMT systems in companies like Google. The figure shows the evolution of NMT systems from 2014 to the present.

Seq2Seq consists of an encoder and a decoder. The encoder is a deep learning model like RNN, LSTM, or GRU. Similarly, the decoder is any deep learning model like RNN, LSTM, or GRU. Initially, the encoder sequentially processes the input tokens. The vector in the last input time step is treated as the aggregation input sequence representation.

With this aggregation input vector as input, the decoder sequentially generates the translated text sequence. The main drawback in this architecture is the use of a fixed vector from the last time step as the aggregate representation. This is because this [a] fixed vector cannot represent the entire input sequence information [b] at each time step in the decoder; the output token depends only on specific input tokens. To overcome this information bottleneck, the attention mechanism is introduced in Seq2Seq models.

The attention layer helps to focus on selective input tokens at each time step in the decoder. Although the attention mechanism has improved the results, the inherent drawbacks of sequential deep learning models (RNN, LSTM, and GRU), like the vanishing gradient problem and inability to take full advantage of the parallel processing power of advanced computer hardware like GPUs and TPUs, transformer based on the self-attention mechanism is proposed.

In the transformer model, the source text sequence is encoded using a stack of encoder layers. The output vectors from the final encoder layer represent source text sequence tokens enriched with rich contextual information using a self-attention mechanism. At each time step, the decoder receives the output vectors from the encoder and generates the target tokens auto-regressively.

ย