Evaluating Unsupervised Sentiment Analysis Tools Using Labeled Data

TextBlob, VADERSentiments, and IBM Watson


Sentiment analysis is one of the most popular natural language processing (NLP) applications in the business world. Also known as opinion-mining, it’s a subfield of NLP that analyzes texts and attempts to classify them as positive or negative.

In supervised learning, this would be called a classification problem, wherein the texts have already been labeled and we use these labels to train machine learning models in order to generalize and classify unseen datasets successfully.

Sentiment analysis also exists in unsupervised learning, where tools/libraries are used to classify opinions with no cheatsheet, or already labeled output. This makes it somewhat hard to evaluate these tools, as there aren’t any pre-prepared answers. Therefore, deciding what tool or model to use to analyze the sentiment of unlabeled text data may not be easily justifiable.

But what if we could compare the performance of these tools using labeled data—i.e. using an unsupervised learning tool/model to analyze texts that have been correctly classified already? In essence, this would require us to evaluate these models with classification metrics and see how they perform. I decided to carry out this experiment, and I’m excited to share my results with you in this post.


For this experiment, we’ll be using three sentiment analyzers in Python: Textblob, VaderSentiments, and IBM-Watson Analyzer.

(All three can be installed as Python libraries, but you’ll need to get an API key for the IBM_watson analyzer to function well. For more details on that, check this link).

The dataset we’ll use is a combination of reviews from Yelp, Amazon, and IMDB. All three data CSV files have just two columns: the textual reviews, and the labels, where 1 is positive and 0 is negative. Since the goal is to evaluate these tools as if this were a classification problem, the Scikit-learn library is needed as well. Let’s get started.


Our analysis and code will be broken down into 3 phases:

  • Getting acquainted with the data
  • Building the analyzers formation
  • Evaluating and interpreting

1. Get acquainted with the data

As aforementioned, the data we’re using is the combination of companies’ reviews, which can be found using this Kaggle link. Altogether, the data has 2748 rows and 2 columns. Below is the code used for this phase:

From the above code, this phase started with importing the necessary libraries and analyzers. The Glob library, along with Pandas, was used to read and convert all 3 CSV files into a dataframe.

There’s a class imbalance in this dataset, where the positive labels constituted more of the dataset than the negative labels, so we carried out undersampling on the dataset. This led to a new dataframe of shape (2724,2). This is how the dataframe looks like after this undersampling:

2. Build the analyzers formation

Below is the code used for this phase:

For the above code, the first step was to instantiate the analyzer to be used. Then a function was created to help analyze all the reviews using the respective analyzers.

To capture the predictions for each analyzer, new columns were created in the dataframe by using the apply function to call the function created. This is what the dataframe looks like after the above steps:

3. Evaluate and Interpret

For this experiment, we’ll be using 4 classification metrics: confusion matrix, accuracy score, precision, and recall. Below is the code used for calculating the confusion matrix:

From the above code, a function was used to iterate through the analyzer columns from the dataframe, calculate each of their confusion matrices, and store the results in a list. This list was then used to create a new dataframe that shows the sum of True Positives (TP), True Negatives(TN), False Positives (FP), and False Negatives (FN) of all 3 analyzers. Below is the result of the sorted dataframe from the highest TP and TN to the lowest:

As we can see, the IBM-Watson analyzer seems to be the best analyzer, as it has the highest TP and TN, and the lowest FP and FN, followed by the VADERSentiment analyzer. But what does this really mean?

To help understand more, the precision, recall, and accuracy scores will be measured from this confusion matrix for all 3 analyzers. See the code used below:

To calculate the measures for all 3 analyzers, a for loop was used to iterate through the rows of the above confusion matrix dataframe and to store all measures of each analyzer in 3 Python lists. These lists were then converted into a dataframe sorted from highest to lowest.

From the above measurement dataframe, we clearly see that the order of importance is similar to the confusion matrix dataframe. But some important insights that can be drawn are:

  • For accuracy, the IBM-Watson analyzer hit 88%; that is, for every 100 reviews it classified/predicted, 88 were correctly classified.
  • For precision, the IBM-Watson analyzer has a rate of 0.91 for positive reviews. This means that out of all the reviews the analyzer predicted as positive, 91% of them were actually positive reviews.
  • For recall, the IBM-Watson analyzer has a rate of 0.84 for positive reviews. This means that out of all the actual positive reviews, only 84% were correctly predicted.

Although the IBM-Watson analyzer has lower recall than precision, it has the best evaluation score in all 3 metrics.

What if we could utilize the scores of all 3 analyzers by doing some sort of a majority-vote classifier?

This would mean choosing the most frequent category among the 3 analyzers for each review as the right category or class. This can be easily done by finding the mode of the 3 analyzer labels and storing each value in a new column like this:

This new Mode classifier can also be evaluated. Below is what the evaluation dataframe looks like after adding the metrics from the Mode classifier:

From the above, we see that the Mode classifier performed better than VADERSentiment and TextBlob in accuracy, recall, and precision, but not as well as the IBM-Watson classifier. The IBM-Watson analyzer still trumps all 4 analyzers (including the majority-vote/Mode classifier).

Please note that the majority of the reviews used were not complex, didn’t have long sentences, and weren’t sarcastic—all of these may be factors in VADERSentiments performing better than the TextBlob analyzer.

Please find all the codes used for this project in this GitHub repo.


In our experiment, the IBM-Watson sentiment analyzer performed the best, followed by VADER Sentiments, and then TextBlob. We also found the mode of these 3 analyzers in each review and got better evaluation scores than TextBlob and VADERSentiment—but it still wasn’t as good as IBM-Watson.

So IBM-Watson seems to be a better choice among the 3 options used. However, the performance of all analyzers used may be improved by feature engineering and cleaning our text data first before analyzing them.


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square