Sentiment Analysis of Speech Using PyDub and SpeechRecognition in Python

Speech-to-text, EDA, and sentiment analysis on text

The ability of a machine or program to identify spoken words and transcribe them to readable text is called speech recognition (speech-to-text). In this tutorial, I will be walking you through analyzing speech data and converting them to a useful text for sentiment analysis using Pydub and SpeechRecognition library in Python.

Sentiment analysis is the use of natural language to classify the opinion of people. It helps to classify words (written or spoken) into positive, negative, or neutral depending on the use case. The sentiment analyzed can help identify the pattern of a product; it helps to know what the users are saying and take the necessary steps to mitigate any problems.


In order to understand this tutorial, you will need basic knowledge of Python programming and the fundamentals of natural language processing. Below are two great courses for those less familiar with these processes.


We will need to install the necessary libraries for this project:

Note: To install FFmpeg for Windows, kindly follow the guide here

Import and convert audio data to .wav format

At this point, it is assumed you have the dependencies installed successfully. Hence, let’s write a function that reads our audio data and converts the data format to a .wav file from any other audio format (mp3, mp4, etc.).

from pydub import AudioSegment
import speech_recognition as sr

def convert_to_wav(filename):

  """Takes an audio file of non .wav format and converts to .wav"""
  # Import audio file
  audio = AudioSegment.from_file(filename)

  # Create new filename
  new_filename = filename.split(".")[0] + ".wav"

  # Export file as .wav
  audio.export(new_filename, format='wav')
  print(f"Converting {filename} to {new_filename}...")

In the block of code above, the AudioSegment class of the pydub library was instantiated (it contains many of the methods you would be using) and the SpeechRecognition library was imported as sr. Also, the function takes in the argument filename (name of the audio file) and uses the from_file method of the AudioSegment class to read the filename and save it as an audio variable. The next line uses the method split to separate the filename from its extension and add it to the .wav using the ‘+’ arithmetic which will concatenate the string. The result is then saved as variable new_filename. Lastly, the audio file was further exported in the .wav file format.

Exploratory Data Analysis (EDA)

Exploratory data analysis is the act of analyzing a dataset to show its main attributes or characteristics. For this project, we shall be using pydub; a Python library for manipulation of audio with a simple and easy interface to extract the following from the audio data: Channels, sample width, frame rate, and length.

The function below will generate the above-listed attributes of the audio data:

def show_pydub_stats(filename):
  """Returns different audio attributes related to an audio file."""
  # Create AudioSegment instance
  audio_segment = AudioSegment.from_file(filename)

  # Print audio attributes and return AudioSegment instance
  print(f"Channels: {audio_segment.channels}")
  print(f"Sample width: {audio_segment.sample_width}")
  print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
  print(f"Frame width: {audio_segment.frame_width}")
  print(f"Length (ms): {len(audio_segment)}")
  return audio_segment

Expected output:

The above output is what you should get when you run the function on the filename.

Also, you can perform basic indexing of the video content as follows:

Basically, slicing audio is done the same way as indexing in Python, but the video is in milliseconds. Therefore, to get two minutes, you have to index by 2000.

In the same way, you can increase or decrease the volume of the audio.

Speech to text

One of the challenges of working with audio data is transcribing them to text for further analysis. The integrity of the transcribed text will have a direct impact on the output of the analysis and therefore the need to ensure the data is correctly transcribed. In the SpeechRecognition library, there are several speech-to-text methods from different creators. For this project, we shall be using Google’s speech-to-text wrapped in the recognize_google method of the library. Let’s write the transcribe_audio function:

def transcribe_audio(filename):

  """Takes a .wav format audio file and transcribes it to text."""
  # Setup a recognizer instance
  recognizer = sr.Recognizer()

  # Import the audio file and convert to audio data
  audio_file = sr.AudioFile(filename)
  with audio_file as source:
  audio_data = recognizer.record(source)

  # Return the transcribed text
  return recognizer.recognize_google(audio_data)

The function above takes in a .wav file as an argument and the recognizer class of the SpeechRecognition is declared in the recognizer variable. Also, the AudioFile method of the SpeechRecognition is used to read the filename and saved as audio_file (AudioFile data structure).

The AudioFile data structure needs to be converted to AudioData (usable data structure for transcribing speech-to-text) and this is done by the .record method of the recognizer. Lastly, in the code, since the data structure is ready for transcription to text, Google speech-to-text is then used to convert to a block of text.

Sentiment analysis of text

At this stage, you should have your audio converted to text and ready for analysis. So, in this project, you will be making use of a pre-trained model in NLTK (Vader) trained on tweets. This is because the main objective is to show how to work with the audio data format. The process for analyzing sentiment is the same after the spoken words of the audio have been transcribed.

It is assumed that you have successfully installed the necessary packages from nltk as specified in the installation section earlier. Therefore, let’s analyze some sentiment!

The audio data can be listened to here before running the codes.

from speech_helpers import convert_to_wav, show_pydub_stats, transcribe_audio
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

filename = 'ex4_call_1_stereo_mp3.mp3'
new_name = 'ex4_call_1_stereo_mp3.wav'


trans_text = transcribe_audio(new_name)

In the code above, I imported the functions we wrote earlier and the SentimentIntensityAnalyzer from the Vader model of NLTK. An instance of the analyzer is stored in sid and the audio data is saved in the filename (to be used as argument of the function) and I named the .wav version of the data in variable new_name (this will be generated when audio is changed to .wav with convert_to_wav function). I transcribed the audio to test with the function and used the polarity_scores of the SentimentIntensityAnalyzer to get the score of the sentiments.

Expected output

The block of text above is what you should get after running the aforementioned codes. The first section is the transcribed text of the audio data. Also, the later part of the code is the result of the sentiment analysis rendered in a dictionary. The results show the negative (neg), neutral (neu), positive (pos), and compound.

The corresponding values for each key show the degree to which the word is negative, neutral, positive, and a combined inference (compound). From the results, we could see that the world is more neutral. However, the compound is scaled within a -1 and +1 such that as scores move closer to -1, the more negative and vice-versa.


The quality of the data can be improved by using some paid API that can transcribe the audio into sentences and the sentiment analysis will then be done across the sentence instead of just analyzing the whole text altogether.

Cheers for making it to the end of this tutorial. Happy learning!

Connect with me on Twitter

Connect with me on LinkedIn

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *