In this tutorial, we’ll build a Recurrent Neural Network (RNN) in PyTorch that will classify people’s names by their languages. We assume that the reader has a basic understanding of PyTorch and machine learning in Python.
At the end of this tutorial, we’ll be able to predict the language of the names based on their spelling. The dataset of names used in this tutorial can be downloaded here.
This tutorial has been adapted from PyTorch’s official docs— check out more about the implementation from these docs.
In this article:
Data Pre-processing
As is the case with any machine learning task, we’ll kick off by loading and preparing our dataset. Upon downloading the dataset, we notice that there’s a folder called names inside the data folder. It contains text files with surnames in eighteen different languages.
In order to load all the files in one go, we’ll use a Python module known as glob. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. Results are returned in an arbitrary order. We’ll use it to load all the files in the folder that end with .txt .
import glob
all_text_files = glob.glob('data/names/*.txt')
print(all_text_files)
Currently the names are in Unicode format. However, we have to convert them to ASCII standard. This will remove the diacritics in the words. For example, the French name Béringer will be converted to Beringer.
import unicodedata
import string
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
def unicode_to_ascii(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
and c in all_letters
)
print(unicode_to_ascii('Béringer'))
In the next step, we create a dictionary with a list of names for each language.
category_languages = {}
all_categories = []
def readLines(filename):
lines = open(filename).read().strip().split('n')
return [unicode_to_ascii(line) for line in lines]
for filename in all_text_files:
category = filename.split('/')[-1].split('.')[0]
all_categories.append(category)
languages = readLines(filename)
category_languages[category] = languages
no_of_languages = len(all_categories)
print('There are {} langauages'.format(no_of_languages))
We can view the first fifteen names in the French dictionary as shown below.
Turning the Names into PyTorch Tensors
When working with data in PyTorch, we have to convert it to PyTorch tensors. This is very similar to NumPy arrays. In our case, we have to convert each letter into a torch tensor. This will be a one-hot vector filled with 0s except for a 1 at the index of the current letter. Let’s show how this is done and then convert the letter M to a one-hot vector.
import torch
def letter_to_tensor(letter):
tensor = torch.zeros(1, n_letters)
letter_index = all_letters.find(letter)
tensor[0][letter_index] = 1
return tensor
def line_to_tensor(line):
tensor = torch.zeros(len(line), 1, n_letters)
for li, letter in enumerate(line):
letter_index = all_letters.find(letter)
tensor[li][0][letter_index] = 1
return tensor
In order to form a single word, we’ll have to join several one-hot vectors to form a 2D matrix.
Building the RNN
When creating a neural network in PyTorch, we use the torch.nn.Module, which is the base class for all neural network modules. torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. torch.nn.LogSoftmax() applies the Log(Softmax(x)) function to an n-dimensional input Tensor.
import torch.nn as nn
from torch.autograd import Variable
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax()
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def init_hidden(self):
return Variable(torch.zeros(1, self.hidden_size))
Testing the RNN
We kick off by creating an instance of the RNN class and passing in the arguments as required.
n_hidden = 128
rnn = RNN(n_letters, n_hidden, no_of_languages)
We’d like the network to give us the probability of each language. In order to achieve this, we’ll pass the tensor of the current letter.
input = Variable(letter_to_tensor('D'))
hidden = rnn.init_hidden()
output, next_hidden = rnn(input, hidden)
print('output.size =', output.size())
input = Variable(line_to_tensor('Derrick'))
hidden = Variable(torch.zeros(1, n_hidden))
output, next_hidden = rnn(input[0], hidden)
print(output)
Training the RNN
In order to get the likelihood of each category, we use Tensor.topk to get the index of the greatest value.
def category_from_output(output):
top_n, top_i = output.data.topk(1)
category_i = top_i[0][0]
return all_categories[category_i], category_i
print(category_from_output(output))
Next we need a quick way to obtain a name and its output.
import random
def random_training_pair():
category = random.choice(all_categories)
line = random.choice(category_languages[category])
category_tensor = Variable(torch.LongTensor([all_categories.index(category)]))
line_tensor = Variable(line_to_tensor(line))
return category, line, category_tensor, line_tensor
for i in range(10):
category, line, category_tensor, line_tensor = random_training_pair()
print('category =', category, '/ line =', line)
The next step is to define the loss function and create an optimizer that will update the parameters of the model according to its gradients. We also specify a learning rate for our model.
criterion = nn.NLLLoss()
learning_rate = 0.005
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)
We move forward to define a function that will create the input and output tensors, compare the final output to the target output, and finally do back-propagation.
def train(category_tensor, line_tensor):
rnn.zero_grad()
hidden = rnn.init_hidden()
for i in range(line_tensor.size()[0]):
output, hidden = rnn(line_tensor[i], hidden)
loss = criterion(output, category_tensor)
loss.backward()
optimizer.step()
return output, loss.data[0]
The next step is to run several examples using this train function as we keep track of the losses for later plotting.
import time
import math
n_epochs = 100000
print_every = 5000
plot_every = 1000
current_loss = 0
all_losses = []
def time_since(since):
now = time.time()
s = now - since
m = math.floor(s / 60)
s -= m * 60
return '%dm %ds' % (m, s)
start = time.time()
for epoch in range(1, n_epochs + 1):
category, line, category_tensor, line_tensor = random_training_pair()
output, loss = train(category_tensor, line_tensor)
current_loss += loss
if epoch % print_every == 0:
guess, guess_i = category_from_output(output)
correct = '✓' if guess == category else '✗ (%s)' % category
print('%d %d%% (%s) %.4f %s / %s %s' % (epoch, epoch / n_epochs * 100, time_since(start), loss, line, guess, correct))
if epoch % plot_every == 0:
all_losses.append(current_loss / plot_every)
current_loss = 0
Plotting the Results
We plot the results using Matplotlib’s pyplot. The plot will show us the learning rate of our network.
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
plt.figure()
plt.plot(all_losses)
Evaluating the Results
We’ll create a confusion matrix in order to see how the network performed on different categories. The bright spots off the main axis show the languages it guesses incorrectly.
confusion = torch.zeros(no_of_languages, no_of_languages)
n_confusion = 10000
def evaluate(line_tensor):
hidden = rnn.init_hidden()
for i in range(line_tensor.size()[0]):
output, hidden = rnn(line_tensor[i], hidden)
return output
for i in range(n_confusion):
category, line, category_tensor, line_tensor = random_training_pair()
output = evaluate(line_tensor)
guess, guess_i = category_from_output(output)
category_i = all_categories.index(category)
confusion[category_i][guess_i] += 1
for i in range(no_of_languages):
confusion[i] = confusion[i] / confusion[i].sum()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
plt.show()
Predicting New Names
We’ll define a function that will take in a name and return the likely languages the name is from.
def predict(input_line, n_predictions=3):
print('n> %s' % input_line)
output = evaluate(Variable(line_to_tensor(input_line)))
topv, topi = output.data.topk(n_predictions, 1, True)
predictions = []
for i in range(n_predictions):
value = topv[0][i]
category_index = topi[0][i]
print('(%.2f) %s' % (value, all_categories[category_index]))
predictions.append([value, all_categories[category_index]])
predict('Austin')
Conclusion
If you’d like to learn more about PyTorch, there are a bunch of tutorials nested in its official docs. If you’re looking to learn more about RNNs, Brian Mwangi. has written a fantastic guide as well:
Discuss this post on Hacker News and Reddit.
Comments 0 Responses