Preparing textual data for machine learning is a little different than the preparation of tabular data. What makes text data different is the fact that it’s majorly in string form.
Therefore, we have to find the best way to represent it in numerical form. In this piece, we’ll see how we can prepare textual data using TensorFlow. Eventually, we’ll build a bidirectional long short term memory model to classify text data.
Import the Necessary Packages
As always, we kick off by importing the packages and modules we’ll use for this exercise:
- Tokenizer for preprocessing the text data
- pad_sequences for ensuring that the final text data has the same length
- sequential for initializing the layers
- Dense for creating the fully connected neural network
- LSTM used to create the LSTM layer
- Bidirectional to ensure that information is passed in both directions
- pandas to load in the text file
- numpy will convert the data into NumPy arrays
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Bidirectional
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
Load the Data
Let’s first import the dataset. The data and code are available on this repository.
df = pd.read_csv('imdb_labelled.txt', delimiter = 't', engine='python', quoting = 3)
We can look at a sample from the dataset. This shows us that we have a Review and a Status column.
df.sample()
Split the Data
Next, we split the dataset into a training set and a testing set. But before we can do that, we first have to define the features X and the target y.
X = df['Review']
y = df['Status']
Let’s now use 80% of the data for training and 20% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X, y , test_size = 0.20)
Data Preprocessing
At this point we need to perform a couple of operations on the text data:
- convert it to lowercase
- filter out punctuation marks such as ? and !
- remove any special characters such as @ and $
- convert the text into a preferred number representation
Now, as we’ll see, doing those operations in TensorFlow is quite straightforward.
In a minute we’ll tokenize the text data to obtain the 1000 most common words. Before we get there, let’s create a couple of variables that we’ll need.
- vocab_size is the number of common words that we want — the maximum number of words that will be included in the word index
- oov_token is the item that will be used to represent words that will not be found in our vocabulary; this is possible when fitting the tokenizer to the testing set. This is represented using the number 1
- max_length is the maximum length of each sequence
- padding_type is used to fill zeros either at the beginning or at the end of a sequence
- trunction_type indicates whether to truncate sentences longer than the max_lenth at the beginning or at the end
vocab_size = 1000
oov_token = "<OOV>"
max_length = 100
padding_type = "post"
trunction_type="post"
Now let’s tokenize — convert the text into a numerical representation. We do this by creating a tokenizer instance and calling fit_on_texts to the training set. The tokenizer will remove the punctuation marks and special characters, and then convert the words to lowercase.
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(X_train)
We can see the mapping of the words and text using the word index.
word_index = tokenizer.word_index
Create Sequences
Let’s now convert the sentences into tokenized sequences. This is done using the texts_to_sequences function.
X_train_sequences = tokenizer.texts_to_sequences(X_train)
Now when we visualize the sentence, we see that each sentence contains a numerical representation of the words in that sentence.
X_train_sequences[10:15]
Pad the Sequences
We can clearly see that the sequences are not of the same length, so let’s pad them to make them of similar length — important before we can pass the data to a deep learning model.
X_train_padded = pad_sequences(X_train_sequences,maxlen=max_length, padding=padding_type,
truncating=trunction_type)
We can now see that there are zeros at the end of the sequences, which makes them all the same length. The padding is taking place at the end of the sequence because we specified the padding type as post.
Now let’s do the same thing to the testing set.
X_test_sequences = tokenizer.texts_to_sequences(X_test)
X_test_padded = pad_sequences(X_test_sequences,maxlen=max_length,
padding=padding_type, truncating=trunction_type)
Prepare GloVe Embeddings
Previously, we’ve seen how we can train our own embedding layer. However, in this article, we’ll use pre-trained embeddings—specifically GloVe embeddings. You can download them from here.
We’ll use the guide from the official Keras blog to create an embedding layer from the pre-trained embeddings.
We start by loading in the GloVe embedding and appending them to a dictionary.
embeddings_index = {}
f = open('glove.6B.100d.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
Next we need to creating an embedding matrix for each word in the training set. This is done by obtaining the embedding vector for each word from the embedding_index.
For example, let’s look at the embedding vector for the word ‘attention.’
embedding_matrix = np.zeros((len(word_index) + 1, max_length))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Words not found in the embedding index will have a matrix representation with all zeros.
Embedding Layer
We can now prepare the embedding layer:
- We set trainable to False because we are using pre-trained word embeddings
- We set the weights to be the embedding_matrix we created above
- len(word_index) + 1 is the size of the vocabulary. We add one because 0 is never use— it is reserved for padding
- input_length is the length the input sequences
embedding_layer = Embedding(len(word_index) + 1,
max_length,
weights=[embedding_matrix],
input_length=max_length,
trainable=False)
Define the Model
Now, the moment we’ve all been waiting for. We define the model with the embedding layer being the first layer, followed by two bidirectional LSTM layers. The bidirectional layers ensure that the model processes the sequence from start to end, as well as backwards.
This is very important, especially when working with problems such as predicting the next word in a sentence, because the context of the words before and after a certain word is useful in predicting the next word.
After that we define a dense layer with 6 units and a final output layer.
embedding_dim = 16
input_length = 100
model = Sequential([
embedding_layer,
Bidirectional(LSTM(embedding_dim, return_sequences=True)),
Bidirectional(LSTM(embedding_dim,)),
Dense(6, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Upon checking the summary we can see that the two bidirectional layers show 32 units; however, we’d defined 16. This is because the information moves forward and backwards, hence the units are doubled. We can also see our non-trainable parameters from the embedding layer.
Train the Model
We can now train and visualize the performance of the model.
history = model.fit(X_train_padded, y_train, epochs=num_epochs, validation_data=(X_test_padded, y_test))
Let’s visualize the model’s accuracy.
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
We see that the training accuracy starts at around 50% and increases steadily. The validation accuracy starts off at around 60% and increases gradually, but settles a little lower than the training accuracy.
Let’s look at the loss visualization. We see that both losses decrease gradually, but the validation loss settles slightly higher than the training loss. The model can definitely be improved by further fine-tuning.
Final Thoughts
In this piece, we’ve walked through an example of processing textual data, converting it into a numerical representation, and using pre-trained embeddings to train a model. You can check the performance using different pre-trained word embeddings.
Other things to try include changing the padding and truncation type, as well as changing the size of the vocabulary and the embedding dimensions.
Comments 0 Responses