Introduction to Text Wrangling Techniques for Natural Language Processing

Sentence Splitting, Tokenization, Stemming, Lemmatization, and Stop Word Removal

What is Text Wrangling?

Although is has many forms, text wrangling is basically the pre-processing work that’s done to prepare raw text data ready for training. Simply put, it’s the process of cleaning your data to make it readable by your program, and then formatting it as such.

Many of you may be wrangling text without knowing it yourself. In this tutorial, I will teach you how to clean up your text in Python. I will show you to perform the most common forms of text wrangling: sentence splitting, tokenization, stemming, lemmatization, and stop word removal.

Prerequisites

Obviously, you’ll need a little bit of Python know-how in order to run the code I’ll show below. I’ll be using a Google Colab notebook to host all my code. I’ll share the link at the end so you can see how your code compares. To create a new notebook, click here.

If you’ve worked with Natural Language code before in Python, you’re probably familiar with the Python package nltk or the Natural Language Toolkit. It’s an amazing library with many functions for building Python programs to work with human language data. Let’s begin by typing the following code:

In this cell, we are importing the library and asking our notebook to download punkt. This is a tokenizer that divides a text into a list of sentences. This brings us to our first example of text wrangling—Sentence Splitting.

Sentence Splitting

If you’ve ever been given a large paragraph of text, you know that the best way to analyze it is by splitting the text into sentences. In real life conversations, we also compute information at the sentence level by analyzing conjoined words. However, trying to split paragraphs of text into sentences can be difficult in raw code. Luckily, with nltk, we can do this quite easily. Type the following code:

This code might be self-explanatory, but it’s okay if this is your first time. Here is what we typed line by line:

  1. First, we define a variable called sampleString that contains a couple of sentences. You can change the text in this variable to whatever you wish.
  2. Next, we import sent_tokenize, which is the sentence tokenization function from the nltk library.
  3. We call the sent_tokenize function on our sampleString. This runs the tokenization function over our string and saves the results to a new variable called tokenized_sent.
  4. Finally, we print tokenized_sent to the log. You should receive an output that looks like this:

As you can see, we were able to split up the paragraph into exact sentences. What’s even more fascinating is that the code knows the difference between a period used to end a sentence versus a period used in the name Mr. Jones.

Tokenization

By now, you’re probably wondering what tokenization is. Well a token is the smallest text unit a machine can process. Therefore, every chunk of text needs to be tokenized before you can run natural language programs on it. Sometimes, it makes sense for the smallest unit to be either a word or a letter. In the previous section, we tokenized the paragraph into sentences.

For a language like English, it can be easy to tokenize text, especially with nltk to guide us. Here’s how we can tokenize text using just a few lines of code:

Like before, we define a variable called msg (short for message). Then, we run a function called split over this chunk of text and print the results to the console. You should receive an output like this:

The split() function is one of the simplest tokenizers. It looks for whitespace as the delimiter (the limit or boundary) and takes the words around it. However, we can take this to the next level with more functions. Type the following:

  1. We import 2 functions from the nltk.tokenize list of functions.
  2. First, let’s look at the word_tokenize() function. This is very similar to the split() function with one key difference. Instead of looking for the whitespace as the delimiter, it even splits the punctuation, as it considers exclamation points and periods as their own tokens.

This is what your output should look like:

Finally, let’s take a look at the regex_tokenize function. This is an even more advanced tokenizer that can be customized to fit your needs. Let’s take a look at an example:

You might notice that we have an extra parameter in this function called pattern. This is where developers can choose how they want to tokenize the text. w+ means that we want all words and digits to be in our token, but symbols like punctuation can be ignored. This is why our output looks like this:

Now, let’s try a different pattern:

Just like before, we have the same function, but with a different pattern: d+. This asks the text to print only the digits. That’s why our output only contains the number 10.

These are the two most common tokenizers you’ll need to clean up your text. Next, let’s move over to stemming, another crucial step in text wrangling.

Stemming

Stemming is exactly what it sounds like—cutting down a token to its root stem. For instance, take the word “running”. It can be broken down to its root: “run”. However, “run” itself has many variation: runs, ran, etc. With stemming, we can club all the variations of the word into a single root. Let’s look at the code to do this:

  1. First, we import the PorterStemmer from the toolkit. There are many algorithms to stem words, and PorterStemmer uses just one of the many. However, I’ve found it to be the most precise since it uses a lot of rules.
  2. Next, we define a variable called porter and set it equal to the PorterStemmer().
  3. Finally, we ask the stemmer to stem the word “running”. You should receive the following output:

Now, you could skip to the next section, but I’d like to take a moment and show you two more stemmers that use different algorithms. The first is Lancaster stemming. It’s very easy to implement and the results are close to that of Porter stemming. Here’s a look:

You should recognize this code by now. It’s the same as the previous example, only this time we import LancasterStemmer. Running the stemmer on the word “eating” gives us an output of:

Now, the last stemmer I want to show you is the SnowballStemmer. What makes this stemmer unique is that it’s been trained on many languages and works well for English, German, French, Russian, and many others. Here’s how you implement it. It’s a little different from the previous two stemmers:

  1. Instead of nltk.stem, we import SnowballStemmer from nltk.stem.snowball since it is another major subsection.
  2. We define snowball as our stemmer. However, when we do so, we define which language the stemmer should detect.
  3. Finally, we stem the word using our newly-created stemmer.

Stemming is great for its simplicity in NLP-related tasks. However, if we want to get more complex, stemming won’t be the best technique to use. Instead, this is where lemmatization shines.

Lemmatization

Lemmatization is much more advanced than stemming because rather than just following rules, this process also takes into account context and part of speech to determine the lemma, or the root form of the word. Here’s a perfect example to show the difference between lemmatization and stemming:

  1. First we download wordnet from the toolkit. Wordnet is a massive semantic dictionary that’s used for search-specific lemmas of words.
  2. Next, we import the WordNetLemmatizer from nltk.stem
  3. We define our variable lem to be the lemmatization function.
  4. Finally, we lemmatize the word “ate” and ask the results to be printed to the console.
  5. For comparison, we use our previously created PorterStemmer to stem the same word and print the result to the console.

WordNet is constantly updating, but at the time of writing, this is what my console displayed:

So we can see that through lemmatization, we can even detect the tenses of the word and present the simplest form of the word in the present tense, all with a few lines of code. Lemmatization is one of the many powerful techniques in text wrangling.

Stop Word removal

Finally, we come to the last section of this tutorial—stop word removal. Stop words are commonly-used word that are usually ignored because of their many occurrences. Most of these words are articles and prepositions, such as “the”, “a”, “in”, etc.

These words can either end up taking too much space or eating up too much time. Luckily, nltk has a list of stop words in 16 different languages. We can use this list to parse paragraphs of text and remove the stop words from them. Here’s how to do it:

This is perhaps the most complex code in this tutorial, so I’ll run through it piece-by-piece:

  1. First, we download the stopwords from the toolkit.
  2. Next, we import all the stop words from nltk.corpus. A corpus is a large dataset of texts.
  3. Next, we define a variable list and set this to contain all the English stop words.
  4. As with any text wrangling technique, we need a sample text, so we type up a short paragraph and define it as the variable paragraph.
  5. We create a new variable called postPara, which is an array of all the words in paragraph split up and not including the words in list.
  6. Finally, we print postPara to our console:

As you can see, our text is split up into different words, but the stop words are removed, showing you only the words deemed important. Most articles and prepositions are gone!

Conclusion

As you can see, text wrangling can be essential in making sure you have the best data to work with. With NLTK, it’s easier than ever to run complex algorithms on your text using only a few lines of code. You can split up your text however you want, weed out the unnecessary parts, and even reduce it to make it the most logical form for your computations.

We’ve barely scratched the surface in terms of what can be done with NLTK. I’d suggest taking a look at the official NLTK website. Feel free to leave me a message in the comments if you’ve got a question or need some help! For reference, here is the link to the complete Colab notebook.

Avatar photo

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *