There are various techniques for handling text data in machine learning. In this article, we’ll look at working with word embeddings in Keras—one such technique. For a deeper introduction to Keras refer to this tutorial:
We’ll use the IMDB Reviews dataset for this tutorial. In an earlier tutorial, we used the same dataset using other techniques such as the bag of words model. Here, we’ll use a technique known as word embedding.
Word embedding is a technique used to represent documents with a dense vector representation. The vocabulary in these documents is mapped to real number vectors. Words that are semantically similar are mapped close to each other in the vector space. There are word embedding models that are ready for us to use, such as Word2Vec and GloVe. However, in this tutorial, we’re going to use Keras to train our own word embedding model.
Let’s get the balling rolling by importing our dataset and checking its head.
import pandas as pd
import numpy as np
df = pd.read_csv('imdb_labelled.txt', delimiter = 't', engine='python', quoting = 3)
df.head()
Next, we’ll import a couple of modules that we’ll need for this exercise.
- array from NumPy to convert the dataset to NumPy arrays
- one_hot to encode the words into a list of integers
- pad_sequences that will be used to pad the sentence sequences to the same length
- Sequential to initialize the neural network
- Dense to facilitate adding of layers to the neural network
- Flatten to reshape the arrays
- Embedding that will implement the embedding layer
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
Next, we create variables with the reviews and the labels.
docs = df['Review']
labels = array(df['Status'])
We’ll use Scikit-learn to separate our dataset to a training set and test set. We’ll train the word embedding on 80% of the data and test it on 20%.
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20)
Let’s now look at one of the reviews. We’ll compare this sentence with its transformation as we move along this tutorial.
print(X_train[1])
We already know that machine learning models only accept numerical data. Therefore, in the next step, we will use one_hot to convert the words into a series of integers. The one_hot function expects the following arguments:
- text the text to be converted
- n the size of the vocabulary
- filters that will specify the characters to filter out such as punctuation marks
- lower to specify if the text should be converted to lower case
- split that specifies how the text should be split
vocab_size = 500
X_train = [one_hot(d, vocab_size,filters='!"#$%&()*+,-./:;<=>?@[]^_`{|}~',lower=True, split=' ') for d in X_train]
X_test = [one_hot(d, vocab_size,filters='!"#$%&()*+,-./:;<=>?@[]^_`{|}~',lower=True, split=' ') for d in X_test]
Let’s now take a look at what our sentence looks like.
print(X_train[1])
Let’s also take a moment to look at a different sentence. We’ll start by looking at the real sentence then later look at its numerical representation. The numerical representation of the words is generated by the one_hot function such that each number represents one word in the sentence.
print(X_train[5])
We can clearly see that the sentences are of different length. In Keras, we usually pass arrays of the same length. To solve this problem we’ll use pad_sequences from Keras to make them the same length. This is achieved by adding zeros before or after the sentence integer representation. It takes the following arguments:
- Sequences, which is a list of the lists containing the integer representations we just created
- maxlen which is the maximum length of all the sequences
- padding which indicates whether to add the zeros before or after the sequence.
max_length = 25
X_train = pad_sequences(X_train, maxlen=max_length, padding='pre')
X_test = pad_sequences(X_test, maxlen=max_length, padding='pre')
Looking at both sequences, we now notice that they are of the same length with zeros appended at the beginning.
print(X_train[1])
print(X_train[5])
Let’s now start creating the model. We initialize it using Sequential and then add the embedding layer. This layer takes a couple of parameters:
- input_dim — the vocabulary
- output_dim — the size of the dense embedding
- input_length — the length of the input sequences
The next thing we do is flatten the embedding layer before passing it to the dense layer. When compiling the model, we use the Adam optimizer and binary cross entropy because it is a classification problem. We then print the model summary and fit it to our dataset.
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())
model.fit(X_train, y_train, epochs=20, verbose=0)
Let’s now check the model accuracy on the training set.
loss, accuracy = model.evaluate(X_train, y_train, verbose=1)
print('Training Accuracy is {}'.format(accuracy*100))
The next thing we can do is check its accuracy on the test set.
loss, accuracy = model.evaluate(X_test,y_test)
print('Testing Accuracy is {} '.format(accuracy*100))
We can get better results on the testing set by using pre-trained word embeddings such as Word2Vec and GloVe. These techniques have been covered by JMangia in this tutorial:
What’s Next
Since machine learning models don’t understand text data, converting sentences into word embedding is a very crucial skill in NLP. Now that you’re familiar with this technique, you can try generating word embeddings with the same data set by using pre-trained word embeddings such as Word2Vec.
You can then compare your results with those obtained when training your own word embeddings. Most often than not, you will notice that the pre-trained models will have a higher accuracy on the testing set.
Comments 0 Responses