Exploring Language Models for Neural Machine Translation (Part One): From RNN to Transformers

The main aim of this article is to introduce you to language models, starting with neural machine translation (NMT) and working towards generative language models.

For the purposes of this tutorial, even with limited prior knowledge of NLP or recurrent neural networks (RNNs), you should be able to follow along and catch up with these state-of-the-art language modeling techniques.


  • Basic knowledge of statistics (i.e. you know the basics of concepts like frequency and probability)
  • An understanding of word embeddings
  • The knowledge and ability to train a normal neural network in PyTorch


  • Understand the statistical nature of language
  • Build a simple neural machine translation system for your native language
  • Understand how positional encoding comes about
  • Build intuition on how attention mechanisms work

As we all know language is very important to human cognitive reasoning, and to build an intelligent system, language is one of the components that needs to be understood.

But lucky for us, language is statistically structured; or rather, the way words are distributed in a sentence is statistically oriented.

For every language that exists, there are ways in which words in a sentence agree with each other—in the English language ,we call this CONCORD.

What is Concord?

Since we aren’t trying to explore the ins and outs of linguistics, we just need a few examples to get the general gist.

Of the two sentences below, which is correct?

  • The cats is on the table
  • The cats are on the table

Base on the rules of concord (subject-verb agreement), the second statement is correct. Since the subject is plural, the verb must also be plural.

However, we have different rules that guide the use of words in different languages. But the major idea is that whenever a particular word is used in a sentence, there is a particular word that is frequently used after it. For example, whenever the word The appears, it’s usually followed by a noun.

This concept—how a word is frequently used after another word—can aid in the statistical modeling of language.

With this basic intuition, you can say then that it’s easier to build a system that models language without the use of deep learning. We could just build a system that keeps a statistical record—i.e the frequency with which each word follows another, and extract a result from that system.

If you want to know a bit more about this concept and how it applies to language modeling, check out the following post (but I’d suggest reading this article first before moving to this one):

While this is a start, such a system does not capture the full statistical nature of language. We need a system that can learn these rules on its own and build intuition by itself without explicit teaching.

To build such a system that learns the rules of grammar and language representation without explicit teaching or hardcoding comes the need for neural models using deep learning techniques.

Deep Learning and Neural Machine Translation

In deep learning, when it comes to NMT, we use a neural network architecture known as a recurrent neural network to model language sequences.

What’s a sequence? Using both the English and mathematical definitions we can capture the meaning:

Here we have the literal meaning. Language is a sequence, music is a sequence—put another way, there’s an order to which you arrange musical notes to form a perfect rhythm, and the same thing applies to words.

And here’s the mathematical definition from Wikipedia:

But we can’t simply pass words into an RNN and expect it to work its magic. Essentially, computers don’t understand any of this gibberish we humans use.

Computers deal with numbers—they don’t see images, nor do they see the words you type.

So for us to interact with the computer, we need to give it what it understands. For this task, each word will be represented by a number.

For example, let’s the same example as before: The cats are on the table.

For us to convert it to a form the model will accept, we can let:

  • 1 represents the
  • 5 represents cats
  • 3 represents are
  • 4 represents on
  • 6 represents table

It’s important to note that we picked these numbers just by guessing at random. Hence, the sentence “The cats are on the table” is now represented by a list of numbers [1,5,3,4,1,6].

If you’ll recall, earlier I mentioned that we can’t just build a system that uses merely the frequency of how words follow each other to model language. Such a statistical system would not be detailed enough.

For example, the system takes the form of P(X| Y). This is interpreted as, given a word Y, what’s the probability of word X coming next. If we assume the word Y to be “the”, then the probability of a noun following it will be very high, and the probability of a verb will be very low.

But such a model isn’t good enough for our aim—we want a model that captures things we can’t see and understands patterns.

This is where the RNN architecture comes into play. Specifically, it includes a learnable parameter to the probability function P(X|Y). This learnable parameter learns everything that needs to be known.

The parameter is called state h. Some call it memory. It describes the state of the previous sequence. You can think of it as keeping what the previous sequence is all about.

Hence, the RNN is using the state and the previous input to determine the next output:

Let the blue box represent P(X|Y). Hence, if we take the first sequence The, the model will look like this P(Cats | The).

The model does the same for all other words. For example, P(on|The) will be very low.

Based on the option that gives the highest probability, we get our output, which is Cats—but where does the h call state come in?

Well, at the initial stage state h will be zero, given that the model has no prior information. That is, the initial stage is P(Cats |The, h=0), but after the initial stage, state h now has some information to store.

We don’t discard the state h; instead, we pass it to the next computation of the model on the next sequence input—that’s what the line connecting the blue box in the image above does.

Then the next computation will look like this: P(are | Cats, The, ht-1). In this case, we use the previous state and the previous word to compute the next output.

The shared state h makes it possible for the model to understand the language better than the simple statistical model I discussed before.

To learn more about RNN with a graphical illustration check out this post on Towards Data Science:

The intuition behind sequence-to-sequence (seq2seq) with RNN

If language (words, sentences) is a sequence, then what do we call the translation of one language to another? We call it sequence-to-sequence (seq2seq).

The main aim of this modeling technique is to have a function f that takes in a word in a language X and output the corresponding word in the other language Y.

Let’s say we have a language in which the word human is the same as the word animal. We want a function that takes in the word human and outputs the word animal…f(“human”) = “animal”.

We could try to use our simple statistical model, but we now know it won’t be able to capture the desired patterns. Language is more complex than the way it’s described in basic terms.

For example, some words can have different meanings based on various contexts, and there are some languages in which words of the same spelling have multiple and different meanings based on their diacritics (signs or accent marks on words indicating different pronunciations).

So how do we build an effective system that translates words from one language to another?

The image above describes it all. We need to build what’s known as an encoder-decoder model.


An encoder works with a simple concept: essentially, we pass in the source word (the word we want to translate to another language) into the encoder model, and then the encoder model compresses the source word by summarizing what the source word is all about.

When I say summarize here, I don’t mean it in the more traditional sense. Previously, we discussed that the state h keeps a description of previous words in a sequence. What an encoder does is reduce the whole sentence into a state h, which is now given a new name called context.


If you’ll also recall, we noted that at the initial stage, state h is always zero, because it does not have any prior knowledge. But now, instead of starting from zero, the state h in the decoder RNN begins with the context from the encoder.

That is, the initial state h for the decoder is not more than zero, but the context state from the encoder, with the help of the decoder, includes information about the source word.

The decoder model will look like this: P( is |cat, the, ht-1, C), As such, we’ll need the context c, the previous state h, and the previous words in order to predict the next word.

To go deeper into using RNN for seq2seq check out the PyTorch tutorial, continue reading, all links will be provided in the reference section.

Shortcomings of seq2seq with RNN

The encoder reduces the whole sequence to a context of fixed length, and the decoder heavily depends on the context from the encoder; hence, the model finds it difficult to deal with longer sentences. We can assume that for longer sentences, the initial context will be lost before the end of the given sequence.

This problem leads us to the concept of attention mechanisms. With this mechanism, instead of summarizing the state, the decoder has access to the states of all the input words (source words).

In order to fix the context lost in a sentence while translating, the model must be able to understand which parts of the word in the source language the decoder should pay more attention to while translating.

Let’s look at an example:

One key question in terms of context: What does the word it stands for…is it the trophy or the bag.

How can we teach our model to decipher which of the words to focus on in order to get what “it” stands for in the sentence? This is the problem attention models attempt to solve.

The original paper provides a more concise explanation of how an attention mechanism is used in NMT.

In the above image, we can see that unlike the other words, the phrase “The animal” focuses more on the word “it”. The model was able to achieve that using an attention model.

Drawbacks of RNNs

The introduction of an attention mechanism to an RNN makes the sequence2sequence model produce better results, but RNNs themselves have a major drawback.

First, it’s a hard network to train. Specifically, RNNs are computationally expensive—hence, they aren’t hardware friendly.

One typical way to reduce this compute burden would be through parallelization, but parallelization isn’t possible since RNN is a sequential model, in that it performs one operation before moving to another operation. This is the second major drawback.

These drawbacks lead to the introduction of a model called Transformer, which can make training faster and also support parallelization.


The transformer model fully makes use of the attention mechanism for its full operation, but unlike RNN, it isn’t sequential. In this section, we will be discussing the major working principles of the transformer achritecture;

The transformer model architecture is like the encoder-decoder model we discussed previously, just that this time we aren’t using a RNN but linear layers and attention.

Let’ try to build our own small neural machine translator with a transformer, and along the way, we’ll learn the ins and outs of the architecture.

Project Base

Firstly, we need to preprocess our dataset for use by the transformer model. And in order to do this, we need to go through some of the steps listed below;

  • create a word vocabulary
  • Tokenization of text
  • Padding of sequence
  • Converting to PyTorch dataset

Before doing anything, we need to get our dataset. For this project, I used Jehova Witness Bible for English and Yoruba.

Apart from that, there’s a free online corpus of a variety of different languages with the corresponding translations. Check out this website called OPUS (open parallel corpus) for more (thanks to Duru Goodness for the recommendation).

The code for this article is framed in a way that any paired language data can be used without much adjustment.

After getting the language pair (in our case English and Yoruba), we want to then build a word vocab for both languages.

We only use the en and yo column for training.

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {"<blank>":0, "SOS":1,"EOS":2}
        self.word2count = {}
        self.index2word = {0:"<blank>", 1: "SOS", 2: "EOS"}
        self.n_words = 3  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
            self.word2count[word] += 1

Thanks to the PyTorch ses2seq tutorial for the utility function.

The Lang class will be instantiated for each of the languages we’re using (English and Yoruba).

The self.word2index map each word to their numerical representation, hence it serves as a lookup table to build sequence for each sentence.

We also have self.index2word, which converts the index (numerical representation) back to its corresponding words. This is needed to convert the sequence outputted by the model to a sentence.

In self.index2word and self.word2index, you can see the pre-defined items:

  • SOS means start of sentence—we use it to denote the beginning of a sequence.
  • EOS means end of sequence—used to denote where the sequence ends.

Why do we need all this? Every sentence does not have the same length, and we want to know where the sequence starts and ends using a common notation. This helps the model know when to stop translating.

And since every sentence doesn’t have the same length, we have to ensure that they do when passed to the model. Specifically, we can’t pass varying length inputs to the model—the maximum length needs to be pre-defined.

And for us to do that, we need to pad smaller length sentences with 0 until it reaches the maximum length. That’s why we need the pre-defined item called blanks to show that the remaining sequence after the EOS isn’t useful.

addsentence is used to loop through our dataset, pick each of the sentences, and obtain each word in the sentence using addwords. During this process, we update the self.word2index and self.index2word.

We’ve now processed the language—but first, we pair the English and the Yoruba sentence using zip:

zdata = list(zip(data["en"].values,data["yo"])) ##zipped the two language together

## create lang
eng_lang = Lang("en")
yor_lang = Lang("yor")

for eng, yor in zdata:

Before adding each of the words to our word vocab, we have the option of normalizing them. Some languages (like Yoruba) have diacritics like this:

We can normalize them:

import unicodedata
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'

We pass in the string to the function to remove the signs and accent marks on each word.

We can also do cool stuff with this idea—for instance, we could build a system that converts normalized text in your language to un-normalized (diacritic) text.


To get the sequence, we need to tokenize each sentence. The basic form of tokenization is to split the words in a sentence with extra space between words in the sentence.

Remember that each of these words has its numeric representation in the self.word2index. Our expected output will be:

And here’s the code:

def tokenize(lang,sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

This tokenization is for the source word, which is in English. To tokenize for the target language, we use:

def tokenize(lang,sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

The start-of-sequence (SOS) and end-of-sequence (EOS) are added to the target text.

Now we can run the tokenize function on the data pair:

pairs2 = np.array(zdata) ## convert the pairs to numpy array

input_token = list(map(lambda x: tokenize(eng_lang,x),pairs2[:,0]))

output_token = list(map(lambda x: tokenize2(yor_lang,x),pairs2[:,1]))


After generating the sequence needed from tokenization, let’s pad the sequence to the same length:


input_tokenPad = np.zeros((len(input_token),MAX_LENGTH))
output_tokenPad = np.zeros((len(output_token),MAX_LENGTH))

for i,v in enumerate(input_token):
    for j, token in enumerate(v):
        input_tokenPad[i,j] = token

for i,v in enumerate(output_token):
    for j, token in enumerate(v):
        output_tokenPad[i,j] = token

First, we defined the maximum length (MAX_LENGTH) per sequence as 100 here—just a choice of mine. This means that if a sequence is more than 100 it will be truncated, and this is bad since some information will be lost.

To prevent truncation, we can easily define the maximum length from the sequence:

You can do the same thing for the input_token. That is, you can define a maximum length for each. And you can choose a number that’s greater than both and use it as the general MAX_LENGTH.

The MAX_LENGTH and the length of the input and output token arrays are used to create a zero matrix, which will be used to store the tokens.

We loop through the tokens and assign them to the corresponding index in the zeros matrix.

PyTorch Dataset

The data is ready to be processed into the PyTorch dataset.

  • For this project, I take a subset of the data for training, then I split that into training and validation sets.
  • In splitting the dataset, we take 20 percent of the data as validation data, then we shuffle.
from sklearn.model_selection import train_test_split

input_tokenPad1 = input_tokenPad[:20000]
output_tokenPad1 = output_tokenPad[:20000]

train_eng, valid_eng,train_yor,valid_yor = train_test_split(input_tokenPad1,output_tokenPad1,test_size=0.2,shuffle=True)
  • We convert the dataset to a TensorDataset and we ensure the sequence is not in float.
  • We then create a Dataloader with a batch size of 100.
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(torch.from_numpy(train_eng).long(),torch.from_numpy(train_yor).long())
valid_data = TensorDataset(torch.from_numpy(valid_eng).long(),torch.from_numpy(valid_yor).long())

batch_size = 100

train_loader= DataLoader(train_data,shuffle=True,batch_size=batch_size,)
valid_loader =DataLoader(valid_data,shuffle=True,batch_size=batch_size,)

In this section, we’ve discussed the basic NLP processes (such as word vocabulary, tokenization, and padding) for structuring our dataset for use by the model.

Although there exist some advanced mechanisms for tokenization such, as byte pair encoding, I won’t discuss them here—I plan on exploring these in the second part of this series.

Having set up the dataset into a proper structure for use by the model, let’s dive into the working principles of the transformer.

Next, we’ll be discussing the core components of a transformer model.

Scaled Attention

To talk about scaled attention, it’ll be useful to quickly define queries, keys, and values. We’ve all used Google to search the internet. To search for information, you type in a query containing some relevant keys necessary for the search engine to give you the result you want, which is the value you desired.

But for the search engine to fetch the right website containing the value you want, it needs to measure the relevance of such information to the keys in your query, and this is simply done by weighing the value produced by each website information with each key in your query.

The illustration above shows how scale dot attention works. The Q, K, and V in the image above represent query, key, and value.

In this section, we’ll dive into the working principles of scale dot attention.

There are a few things the model (encoder-decoder) must pay attention to:

  • Attention between the input and the output tokens
  • Attention between the input tokens
  • Attention between the output tokens

The attention between the input (or output) tokens is called self-attention.

But now instead of just using the previous word, we want to know which part in the previous word affects the next word the most.

In the previous example (The trophy can’t fit into the bag because it is too big), to obtain the meaning of it in the sentence, we need to pay attention to the previous word, and the most relevant word to pay attention to is Trophy. this will aid in the translation of the sentence to another language.

The same thing occurs in the decoder portion of the network while translating—here, we also need to pay attention to the previous word being generated, to correctly construct the text being translated.

In the image Q stands for query, k stands for key, while v stands for value. These entities can all come from the same source.

Query, key, and value, how do we come about them? the query, key, and value are generated from the Embedding Layer.

For self-attention, they all come from the embedding layer’s interaction with a linear layer as illustrated above. The query and the key are multiplied together, and then softmax is used to produce the attention weight.

From the image, we just use three words out of the sentence—normally the query, key, and value have numbers of rows equal to the sequence length, and the number of columns is the number of states. Put another way, we can say that the query, key, and value contain the state change for each of the words in the sequence.

Multiplying the query state and the key state, and then will pass their product value through a softmax function, which will help us capture the relevance score of each word, relative to the other.

From the image, we pinpoint that for the word “it”, the array contains 12 items that represent the number of words in the whole sentence: “The trophy can’t fit into the bag because it is too big”.

Each column in the array gives the probability score (also called a relevance score) for each word with respect to the word it. And we can see that trophy has the highest score.

I forgot to mention that, before the softmax score, the product value is divided by the square root of their dimension. This is done to make sure that the mean is close to 1 and the standard deviation is close to 0. In simple terms, it prevents the explosion of the product value.

The product of the softmax is multiplied by the value state, which is used to get the summary state of each word based on its relevance score to other words.

def attention(query, key, value, mask=None, dropout=0.0):
  d_k = query.size(-1)
  scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
  if mask is not None:
    scores = scores.masked_fill(mask ==0, -1e-9)
  p_attn = F.softmax(scores, dim= -1)
  p_attn = F.dropout(p_attn, p=dropout)
  return torch.matmul(p_attn, value), p_attn

Before we move on to multi-head attention, we need to quickly discuss masking.

Masking is necessary for two things. First, we need to prevent calculating a score for the blank space in the sequence. Recall that each sentence is padded in order to prevent inputting a sequence of varying length to the model:

We don’t want to calculate the softmax probability for the zeros, so we mask them out:

After the softmax score, the output score for the zeros will be -1e-9, and if -1e-9 is replaced by float(‘inf’), then the score for the zeros is replaced with nan, and we can’t use it because we can’t pass nan to the model.

The second reason why we use masking is that we don’t want to see all the words at once. As such, we create what’s known as a subsequent mask, which allows us to expose target words to the model one at a time — or “subsequently”.

For example, let say the phrase “The trophy can’t fit” is the target. With subsequent masking, we’ll have:

But since we’re using an integer sequence, we represent it like this:

This helps the decoder leverage self-attention—remember that for us to predict the next word we need the previous word, and this helps the model know which word to pay attention to from the previous sequence

For proper translation, the decoder still needs to pay attention to the source text. Hence, the model performs attention between the decoder and the encoder query.

The attention we discuss here is just a one-plane projection—but what if we implement the attention mechanism at a different plane projection? By doing so, we’ll be able to capture more things and get more intuition.

The transformer model makes use of multi-head Attention to perform scale dot attention at different projection. The number of projections to perform the scale dot attention can be specified—hence this number is denoted by the term head.

An overview of the multi-head attention approach:

  • Different states are being created, each containing its own query, key, and value
  • We pass each state through the scaled dot attention mechanism independently
  • The output of each of the state from the scale dot attention is concatenated into a single state
  • The concatenated output is passed through a linear layer
class MultiHeadedAttention(nn.Module):

  def __init__(self, h, d_model, dropout=0.1):
    super(MultiHeadedAttention, self).__init__()
    assert d_model % h == 0

    self.d_k = d_model // h
    self.h = h
    self.p = dropout
    self.linears = clones(nn.Linear(d_model,d_model),4)
    self.attn = None

  def forward(self, query, key, value, mask=None):

    if mask is not None:
      mask = mask.unsqueeze(1)

    nbatches = query.size(0)

    query, key, value = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1,2)
                        for l,x in zip(self.linears, (query, key, value))]
    x, self_attn = attention(query, key, value, mask=mask, dropout=self.p)

    x = x.transpose(1,2).contiguous().view(nbatches, -1, self.h * self.d_k)

    return self.linears[-1](x)

Positional Encoding

RNN has this unique feature called timestamps, which captures the order between sequences—when we use a transformer we lose this order.

This is bad news because we won’t be able to capture how words follow each other, and without this, we can’t capture their state properly. And if the states are not captured well, the output will be gibberish.

To ensure the state is well-captured, we need the order in the sequence. To ensure this, we use a technique known as positional encoding.

If you’re coming from a background in machine learning using tabular data, you’ll know that a key feature engineering technique is converting dates, months, and hours into something that captures orderliness.

Dates, months, days of the weeks are cyclical data—using label encoding, we lose this information about the cyclical property.

For example, this is what it would look like to encode days of the week:

Look at the way they’re structured—the difference between days is simple 1. This looks okay at first glance—there’s order here

But look at Sunday and Saturday—the difference between their encodings is 6. That would represent that Sunday is actually far from Saturday, but we of course know that Sunday is just the next day after Saturday. But the encoding method we used here didn’t capture the cyclic nature of days of the week.

The same thing happens for time differences between 22:00 and 23:00 is just one hour, but if you considered 23:00 and 0:00, that difference becomes 23 hours, when in reality the difference actually is just one hour.

In machine learning, we use sine and cosine to capture this cyclic nature. sine is an odd equation—it can be used to encode odd positions like 3, 5, 7, etc.—and cosine is an even function—it captures 2, 4, 6, etc.

Check out the image below for an illustration of this:

Draw the sine:

And if we join the sine and cosine dataset together we can see that it captures the cyclic property of the data:

If you’re interested in how this can improve your machine learning model, check out the full tutorial here.

Now let’s transfer this intuition over to the transformer architecture. Remember we’re using word embeddings, which create an embedding matrix for each word:

The dimension in the embedding creates a matrix representation for each word. Let’s say we use an embedding dimension of 100, then each word will have a matrix of (1,100).

Next we want to encode position across the embedding dimension. In the hour example, we only use a single dimension:

For positional encoding, we create the same matrix as the embedding layer:

Remember, sine is odd and cosine is even:

With this, we fill in the odd and even position across each dimension with the encoding generated by the equation below:

This is then added with the word embedding to capture the position of words across all the dimensions.


We’ve discussed some of the major techniques of the transformer model and why it can replace RNNs for sequence2sequence models.

Some of the key points to remember are:

  • The common NLP processes for language models are word vocabulary, tokenization, and padding.
  • RNN is sequential, which makes it hard to parallelize.
  • It’s best not to remove the diacritic (accent marks, etc) of the language, in order to capture its tone.
  • For self attention, the query, key, and value come from the same source (embedding layer). This is true except for the attention mechanism between decoder and encoder, in which the query is from the decoder embedding layer and the keys and value are from the encoder.
  • Heads in multi-scale systems signify the number of projections to run the scale dot attention.
  • Positional encoding helps with adding timestamps from RNN to the transformer model.

Output example:

One amazing thing about the model is that even without giving it a single standalone word during training it was able to translate for a single word.

To get a better result from training the model, there exist some other techniques that aren’t actually part of the Transformer model architecture, but they help in training language models.

Let wrap it up here. In the next article, we’ll discuss how to create a language model like GPT-2 from the same transformer we used for neural machine translation in this tutorial.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *