Modeling a Language Translation System using LSTM for Mobile Devices or Web

Running machine learning models on your Jupyter Notebook is fine, but have you ever thought about how these models can run on your mobile device, which has limited space and processing power?

If we use models directly in our mobile application, it will increase the size of the mobile app too much — so how do we manage this on a mobile device? I am going to go through the answers to these questions in this article.

So, when the data become complicated in machine learning, we need to use some complex models to deal with complex datasets. But when we use a complex model, like in the case of deep learning, the number of layers and neurons gets an increase and this leads to an increase in model size.

TensorFlow allows us an approach to reduce the model size to make it suitable for mobile devices and we can achieve this using TensorFlow Lite.

TensorFlow Lite is an open-source deep learning framework for mobile device inference. It is essentially a set of tools to help us run TensorFlow models on mobile, embedded, and IoT devices. TensorFlow Lite enables on-device machine learning inference with low latency and small binary size.

TensorFlow Lite has two main components:

TensorFlow Lite interpreter: The interpreter runs optimized Lite models on many different hardware types, including mobile phones and embedded devices.
TensorFlow Lite converter: The converter converts the TensorFlow model into an efficient form to be used by the interpreter. This approach can introduce optimizations to improve binary size as well as performance.

Now, to convert our TensorFlow model to the TensorFlow Lite model, we first need to build and train a TensorFlow model. Here, we will train our language translation model then finally we will convert our model to TensorFlow Lite so that we can utilize it for our mobile devices.

Data

We will use the English to Hindi translation dataset, which has around 3500 conversations that we use in our day-to-day life. Data can be grabbed from any open-source resource. You can also get it from my GitHub. The data is separated by “n”. We need to convert the data into question and answer lists where questions will be our native language, and answers will be our translated language.

import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers , activations , models , preprocessing
from tensorflow.keras import preprocessing , utils
import os
import yaml
import json
import pandas as pd
from tensorflow.keras.callbacks import ModelCheckpoint
import pandas as pd

docs=open(r'DATASET.txt',encoding='utf-8').read().split("n")
c=docs[0].strip().split("t")

Our data is in text format. We can use Python code to split our data by row and then we can create a list for our questions and answers.

Setting Vector Size & Formatting Data

Our data has a considerable size. We need to break our data into different parts and use those parts to train our deep learning model so that our machine doesn’t run out of memory. We can customize the chunk size based on our machine configuration. The vector size is the size of the output array; we need to define it so that all the output array can have the same size. Vector size can vary based on the word embedding model used.

questions_for_token = list()
answers_for_token = list()
c=1
for con in docs:
    if(c==2868):
        pass
    else:
        con=con.strip().split("t")
        questions_for_token.append(con[0])
        answers_for_token.append(con[1])
        c+=1
        
embed_size=100 #define the vector size based on word your embedding
max_features=6000 #to restrict your number of unique words
maxlen=100

Processing Our Data

The data needs to be processed well so that it can give better results. Here, in our language translation system, we will use some text cleaning methods like:

Removing all the stop words.
Change in word case.
Removing all the numeric data.

import re
def processTweet(chat):
    chat = chat.lower()
    chat = re.sub('((www.[^s]+)|(https?://[^s]+))','',chat)
    chat = re.sub('@[^s]+','',chat)
    chat = re.sub('[s]+', ' ', chat)
    chat = re.sub(r'#([^s]+)', r'1', chat)
    chat = re.sub(r'[.!:?-'"\/]', r'', chat)
    chat = chat.strip(''"')
    return chat

def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)1{2,}", re.DOTALL)
    return pattern.sub(r"1", s)

def getFeatureVector(chat):
    chat=processTweet(chat)
    featureVector = []
    #split tweet into words
    words = chat.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = w.strip(''"?,.')
        #check if the word stats with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #ignore if it is a stop word
        if(val is None):
            continue
        else:
            featureVector.append(w.lower())
    return " ".join(list(featureVector))

Defining Word Embedding

To convert our text data into numerical form, we will be using the word embedding technique. Here, we will be utilizing the GloVe model. GloVe model combines the benefits of the word2vec skip-gram model in the word analogy tasks. This GloVe model can be found on Google. It has a text format that we can import using the below code. In the GloVe model, “100d” denotes the size of the word vector as output. GloVe embedding is famous for small size embedding and is enough for our day-to-day chats.

def emb_mat(nb_words):
    EMBEDDING_FILE="glove.6B.100d.txt"
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE, encoding="utf8"))

    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    emb_mean,emb_std

    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words+1, embed_size))
    for word, i in word_index.items():
        if (i >= max_features) or i==nb_words:
            continue
        embedding_vector = embeddings_index.get(word) #here we will get embedding for each word from GloVe
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

Tokenization of Data

After initializing our word embedding model, we need to tokenize our data using embedding. The embedding model converts each word into a defined size vector of numbers. Our machine learning or deep learning models work on numeric data, so it is necessary to convert any text data to numeric data by defining each word to a specific vector so that we can later identify them. This word embedding model is brilliant in that it can tell the similarity between words.

def tokenized_data(questions,answers,VOCAB_SIZE,tokenizer):
    # encoder_input_data
    import numpy as np
    tokenized_questions = tokenizer.texts_to_sequences( questions )
    maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
    padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions , maxlen=maxlen , padding='post' )
    encoder_input_data = np.array( padded_questions )
    #print( encoder_input_data.shape , maxlen_questions )

    # decoder_input_data
    tokenized_answers = tokenizer.texts_to_sequences( answers )
    maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
    padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen , padding='post' )
    decoder_input_data = np.array( padded_answers )
    #print( decoder_input_data.shape , maxlen_answers )

    # decoder_output_data
    tokenized_answers = tokenizer.texts_to_sequences( answers )
    for i in range(len(tokenized_answers)) :
        tokenized_answers[i] = tokenized_answers[i][1:] # remove <start> take rest
    padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen , padding='post' )
    onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE)
    decoder_output_data = np.array( onehot_answers )
    #print( decoder_output_data.shape )
    
    return [encoder_input_data,decoder_input_data,decoder_output_data,maxlen_answers]

Preparing Data for Modeling

We need to use our defined data processing steps to clean our data and use tokenized_data.py to convert them into tokens. Here, we will take a question (native language) and an answer (target language) set as input. We will apply text cleaning steps, and finally, we will pass them by our pre-trained word2vec model to assign each word a vector. And then we’ll take the average of the word vectors to make a sentence vector.

We also need to define the start and end of the chat sentences so that our model can understand where is the beginning of a particular sentence and where the sentence ends, which helps our model in inferencing.

def prepare_data(questions,answers):
    answers=pd.DataFrame(answers, columns=["Ans"])
    questions=pd.DataFrame(questions, columns=["Question"])
    questions["TokQues"]=questions["Question"].apply(getFeatureVector)

    answers=np.array(answers["Ans"])
    questions=np.array(questions["TokQues"])

    answers_with_tags = list()
    for i in range( len( answers ) ):
        if type( answers[i] ) == str:
            answers_with_tags.append( answers[i] )
        else:
            print(questions[i])
            print(answers[i])
            print(type(answers[i]))
            questions.pop(i)

    answers = list()
    for i in range( len( answers_with_tags ) ) :
        answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )
    
    
    tokenizer = preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(questions+answers)

    word_index = tokenizer.word_index
    nb_words = min(max_features, len(word_index))

    #embedding_matrix=emb_mat(nb_words)[0]
    #emb_vec=emb_mat(nb_words)[1]

    VOCAB_SIZE = len( tokenizer.word_index )+1
    
    
    tok_out=tokenized_data(questions,answers,VOCAB_SIZE,tokenizer)
    encoder_input_data=tok_out[0]
    decoder_input_data=tok_out[1]
    decoder_output_data=tok_out[2]
    maxlen_answers=tok_out[3]
    
    return [encoder_input_data,decoder_input_data,decoder_output_data,maxlen_answers,nb_words,word_index,tokenizer]

Model Training

It’s time to train our model. We will here use cleaned and vectored format data to pass it to a sequence to sequence model. Our model will be trained over all the conversations using the batch data that we have defined at the beginning.

Prepared_data=prepare_data(questions_for_token,answers_for_token)
encoder_input_data=Prepared_data[0]
decoder_input_data=Prepared_data[1]
decoder_output_data=Prepared_data[2]
maxlen_answers=Prepared_data[3]
nb_words=Prepared_data[4]
word_index=Prepared_data[5]
tokenizer=Prepared_data[6]
embedding_matrix=emb_mat(nb_words)
encoder_inputs = tf.keras.layers.Input(shape=( None , ))
encoder_embedding = tf.keras.layers.Embedding( nb_words+1, embed_size , mask_zero=True, weights=[embedding_matrix]) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
decoder_embedding = tf.keras.layers.Embedding( nb_words+1, embed_size , mask_zero=True,weights=[embedding_matrix]) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )

decoder_dense = tf.keras.layers.Dense( nb_words+1 , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )


model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=10, epochs=20, callbacks=callbacks_list)

Making Model Ready for Mobile Devices

As of now, we have our TensorFlow model ready to be run in the terminal or Jupyter Notebook, but we need to make it ready for mobile devices. And, to do that, TensorFlow provides us with some functions that can convert our TensorFlow model to TFlite.

# Convert the model.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

Saving TFLite Model

Finally, we can save our TFlite model so that we can use it on mobile devices using Google Firebase or locally in your Android or iOS app. Our TFlite model has a size of around 13 MB, but we can use it locally also because it can make the inference faster, and, being a small size model, it won’t increase the app size to a big extent.

This process is all based on your requirements. If you don’t want to increase the size of your app and are good to go with a model that takes some extra time to inference, then you can go ahead and deploy the app to Firebase and utilize it using API calls from your Android app. On the other hand, if you are good with the extra size of your Android app and need faster inference, then use the model locally.

# Save the model.
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Chatbot Demo

Here is the final demo of our chatbot that runs on a mobile device. We can also integrate it into a website. It works like Google Translate, and the main thing is that you can use it for free using API calls: You don’t need to pay any third party to use their models. I have used the Hindi language to train the model. You can play with this in your own native language.

Until then, keep exploring data science!