Training a Core ML Model for Sentiment Analysis

Why is Sentiment Analysis Important?

Sentiment analysis has become a hot topic in the fields of Natural Language Processing and machine learning. Sentiment analysis is the process of examining a piece of text for opinions and feelings.

There are innumerable real-life use cases for sentiment analysis that include understanding how consumers feel about a product or service, looking for signs of depression, or to see how people respond to certain ad and political campaigns.

The applications of sentiment analysis are broad and powerful. Sentiment analysis can be used to give valuable insights to businesses on how people feel about their brand. When applied to social media channels, sentiment analysis can give insights on what the people think about a certain topic or advertisement.

It’s also a very important tool in the mental health field and can be implemented to identify signs of depression or anxiety in a person and inform family members.

Training the Model

In this article, we’ll be training a ML model using a Scikit-learn pipeline that will be able to identify whether a piece of text is positive, negative, or neutral, and then convert it to a Core ML model using coremltools.

We’ll be using the dataset, which can be downloaded here. It’s a .csv file that contains labelled reviews of products.

Let’s start with the code.

We start by importing the required libraries:

import re
import coremltools
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk import word_tokenize
from string import punctuation
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

We import re, which is the regular expressions library and provides the matching operations that make it easy to handle text data. The nltk library functions are imported to format the text as per our requirements, and sklearn offers the machine learning tools required.

The coremltools library helps us convert the sklearn model to a Core ML model.

You might need to download nltk data before you can use the library. Open a Python instance in your terminal and run the following commands:

After importing the required libraries, we get the dataset .csv file using pandas and then convert it to an array:

reviews = pd.read_csv('epinions.csv')
reviews = reviews.as_matrix()[:, :]

The above code reads the .csv file as a pandas dataframe and then converts it to a numpy array that includes all rows and columns.

Now that we have the dataset ready for use, we define a function that helps us extract features from the data. In machine learning, features are defined as the properties of the phenomena being observed. Feature selection is an integral part of machine learning, and the accuracy of a model depends heavily on it.

For our model, we use the following function for feature selection:

def features(sentence):
    stop_words = stopwords.words('english') + list(punctuation)
    words = word_tokenize(sentence)
    words = [w.lower() for w in words]
    filtered = [w for w in words if w not in stop_words and not w.isdigit()]
    words = {}
    for word in filtered:
        if word in words:
            words[word] += 1.0
            words[word] = 1.0
    return words

First, we store the stop words and punctuation marks that need to be ignored in a variable. Stop words are words that do not effect the meaning of a sentence and are ignored.

We then tokenize the sentence into words so that they can be used individually. All the words are then converted to lower case and then filtered for stop words, punctuations, and digits, as these do not affect the sentiment of a sentence.

The for loop at the last stores each word in the ‘words’ dictionary.

Now that we have the function to extract features ready, lets get the features from the dataset:

features = np.vectorize(features)

X = features(reviews[:, 1])

y = reviews[:, 0]

We begin with vectorizing the features function. Python loops are inherently slower than their C counterparts, so numpy offers vectorized actions on numpy arrays, which take them down to C level and increase the efficiency of your code.

After vectorizing the function, we extract the features of every sentence in the dataframe and store it in ‘x’.

At the end, we set the target variable. In machine learning, the target variable is the variable that is or should be the output. In this case, we get the sentiment label for every sentence in the dataframe.

We have everything we require for training the model, so let’s proceed to the trickier part of the tutorial.

clf = Pipeline([("dct", DictVectorizer()), ("svc", LinearSVC())])
params = {
    "svc__C": [1e15, 1e13, 1e11, 1e9, 1e7, 1e5, 1e3, 1e1, 1e-1, 1e-3, 1e-5]
gs = GridSearchCV(clf, params, cv=10, verbose=2, n_jobs=-1), y)
model = gs.best_estimator_

In machine learning, we need to perform a sequence of transformations to our data before actually training a model. The Scikit-learn pipeline gives us a single interface for all steps of transformation and the resulting estimator. It encapsulates all the transformers and predictors inside.

In our case, we create a pipeline with DictVectorizer and LinearSVC. DictVectorizer, as the name suggests, converts the dictionary to vectors. LinearSVC stands for Linear Support Vector Classification. It’s an implementation of Support Vector Classification for linear kernels.

At this point, we can simply train the model using our classifier, but we won’t do so. Instead, we’ll use GridSearchCV to select the best model from a family of models, parametrized by a grid of parameters. In simple words, GridSearchCV is a model selection tool.

We’ll create a dictionary containing parameter name (String) as the key and a list of parameter settings as the value.

Next, we define a variable as GridSearchCV with 5 parameters:

  1. clf is our estimator.
  2. params is the param_grid we created.
  3. cv determines the cross-validation strategy. In our case, cv=10 specifies the number of folds in a KFold.
  4. verbose controls the verbosity. The higher the value, the more the number of messages.
  5. n_jobs defines the number of jobs to be run in parallel.

After this, we fit our model on the feature and target variable we created earlier.

At the end, we store the model with the best cv score in a variable using the best_estimator_ function of GridSearchCV.

This is all we need to do to train a sentiment analysis model. Cross-Validation and Grid-Search are advanced topics that might confuse you in the beginning, so for more information on these refer here.

We have the model ready, but now we need to convert it to a Core ML model before we can use it in our app. We do so by using coremltools.

coreml_model = coremltools.converters.sklearn.convert(model) = 'Your Name'
coreml_model.license = 'MIT'
coreml_model.short_description = 'Sentiment polarity LinearSVC.'
coreml_model.input_description['input'] = 'Features extracted from the text.'
coreml_model.output_description['classLabel'] = 'The most likely polarity (positive or negative), for the given input.'
coreml_model.output_description['classProbability'] = 'The probabilities for each class label, for the given input.''SentimentPolarity.mlmodel')

As you can see, the code is pretty self-explanatory. We create a variable to store the converted model using the coremltools libary.

We define the author, license, and description of the model, and also give a description of what the model takes as an input and produces as an output. At the end, we save the model as a Core ML model with the .mlmodel extension in the working directory.

That’s it! We’re done. After you run this code, you’ll have your own sentiment analysis Core ML model ready to be implemented in your iOS app.

If you want to know how to implement a Core ML model in your app, check out the official documentation:

If you face any issue in training or implementing your model, please let me know!

Discuss this post on Hacker News

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *