NLP Chronicles: Introduction to Natural Language Processing with NLTK

What is Natural Language Processing?

Natural Language Processing (NLP) helps computers (machines) “read and understand” text or speech by simulating human language abilities.

However, in recent years, NLP has grown rapidly because of an abundance of data. Given that more and more unstructured data is available, NLP has gained immense popularity.

Introduction To The Series

The purpose of this article series NLP Chronicles is to introduce you to the NLP field while offering practical knowledge on how to tackle various NLP-related tasks. There is no final destination of this series. It’s best to think of this as wandering into the wilderness of NLP.

Since this is an introductory article, we’ll start with basic NLP tasks and how to perform them with NLTK (which is a Python NLP library). However, as the series progresses, we will be covering more complex topics.

For this series, we will be using Python because it has a rich toolset for every aspect of NLP. It’s necessary to say that Java also has a pretty good tool set, but in my mind, it’s not as effective as Python.

Prerequisites

  • Python 3.+
  • Jupyter Notebook

Natural Language Tool Kit (NLTK)

Natural Language Tool Kit (NLTK) is by far the most popular Python toolkit for dealing with NLP-related tasks. It comes with numerous examples and a really great API that’s very clear and concise. It also has numerous corpora and other tools to cater to most NLP-related tasks.

To install NLTK on Anaconda, follow the given link:

Also, we need to install all the packages and corpora that comes with NLTK separately. Following this link will help with that process:

Basic Tasks of Natural Language Processing

In this article, we’ll be discussing how to perform fundamental NLP tasks. We’ll be discussing each task in detail and also demonstrating how to perform it using NLTK.

  1. Tokenization
  2. Word Stemming and Lemmatization
  3. Part-Of-Speech (POS) Tagging
  4. Chunking
  5. Stop Word Removal
  6. Named Entity Recognition

Tokenization

Tokenization (also known as word segmentation) is the process of breaking text into smaller meaningful elements called tokens. These so-called tokens can be words, numbers, or punctuation marks.

This process is done by using a tokenization algorithm, which identifies the word or sentence boundaries and splits at the boundary. Tokenization is a crucial step in most NLP-related tasks. In most cases, it functions as a pre-processing step.

Since tokenization is relatively easy and uninteresting compared to other NLP tasks, it’s overlooked most of the time. However, errors made in this phase will propagate into later stages and cause problems.

Sentence Tokenization

Sentence tokenization is the process of tokenizing a text into sentences. To perform sentence-level tokenization, NLTK provides a method called sent_tokenize. This method uses an instance of PunktSentenceTokenizer.

We import the sent_tokenize method as depicted in the code snippet below. The method takes a string as a parameter and returns an array of sentences. The tokenizer is already trained for English and a few other European languages.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import sent_tokenize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = "Success is not final. Failure is not fatal. It is the courage to continue that counts.""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Success is not final.', 'Failure is not fatal.', 'It is the courage to continue that counts.']n"
     ]
    }
   ],
   "source": [
    "sentence_tokens = sent_tokenize(text)n",
    "print(sentence_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Success is not final.n",
      "Failure is not fatal.n",
      "It is the courage to continue that counts.n"
     ]
    }
   ],
   "source": [
    "for sentence in sentence_tokens:n",
    "    print(sentence)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Word Tokenization

Word tokenization is the process of tokenizing sentences or text into words and punctuation. NLTK provides several ways to perform word-level tokenization.

It provides a method called word_tokenize, which splits text using punctuation and non-alphabetic characters. This method is a wrapper method for the TreebankWordTokenizer. Therefore, the result from both are identical.

NLTK also provides other tokenizers, such as WordPunctTokenizer and WhitespaceTokenizer. WordPunctTokenizer also splits the text from the punctuation. But unlike the TreebankWordTokenizer, this tokenizer splits the punctuation into separate tokens. WhitespaceTokenizer, as the name suggests, splits the text using white spaces. There are a few other tokenizers available as well.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import word_tokenize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "Let's see how the tokenizer split's this!""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Let', "'s", 'see', 'how', 'the', 'tokenizer', 'split', "'s", 'this', '!']n"
     ]
    }
   ],
   "source": [
    "word_tokens = word_tokenize(sentence)n",
    "print(word_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import TreebankWordTokenizer, WordPunctTokenizer, WhitespaceTokenizern",
    "tree_tokenizer = TreebankWordTokenizer()n",
    "word_punct_tokenizer = WordPunctTokenizer()n",
    "white_space_tokenizer = WhitespaceTokenizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Let', "'s", 'see', 'how', 'the', 'tokenizer', 'split', "'s", 'this', '!']n"
     ]
    }
   ],
   "source": [
    "word_tokens = tree_tokenizer.tokenize(sentence)n",
    "print(word_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Let', "'", 's', 'see', 'how', 'the', 'tokenizer', 'split', "'", 's', 'this', '!']n"
     ]
    }
   ],
   "source": [
    "word_tokens = word_punct_tokenizer.tokenize(sentence)n",
    "print(word_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "["Let's", 'see', 'how', 'the', 'tokenizer', "split's", 'this!']n"
     ]
    }
   ],
   "source": [
    "word_tokens = white_space_tokenizer.tokenize(sentence)n",
    "print(word_tokens)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Word Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce an inflected (or derived) word’s form to its root or base form. It’s essential in many NLP-related tasks such as information retrieval, text summarization, topic extraction, and more.

Even though the goal is similar, the process by which it’s done is different.

Stemming

Stemming is a heuristic process in which a word’s endings are chopped off in hope of achieving its base form. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result..

Stemming isn’t as easy as we presume. If it was, there would be only one implementation. Sadly, stemming is an imprecise science, which leads to issues such as understemming and overstemming.

NLTK provides several stemmers, the most prominent being PorterStemmer, which is based on the Porter Stemming Algorithm. This is mainly because it provides better results than the rest of the stemmers.

Other stemmers include SnowballStemmer and LancasterStemmer. It’s worth mentioning thatSnowballStemmer supports other languages as well. The following code snippet compares the aforementioned stemmers.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lien",
      "lien",
      "lien"
     ]
    }
   ],
   "source": [
    "porter_stemmer = PorterStemmer()n",
    "print(porter_stemmer.stem('lying'))n",
    "print(porter_stemmer.stem('lies'))n",
    "print(porter_stemmer.stem('lied'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lyingn",
      "lien",
      "liedn"
     ]
    }
   ],
   "source": [
    "lancaster_stemmer = LancasterStemmer()n",
    "print(lancaster_stemmer.stem('lying'))n",
    "print(lancaster_stemmer.stem('lies'))n",
    "print(lancaster_stemmer.stem('lied'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lien",
      "lien",
      "lien"
     ]
    }
   ],
   "source": [
    "snowball_stemmer = SnowballStemmer('english')n",
    "print(snowball_stemmer.stem('lying'))n",
    "print(snowball_stemmer.stem('lies'))n",
    "print(snowball_stemmer.stem('lied'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Lemmatization

Lemmatization is a process that uses vocabulary and morphological analysis of words to remove the inflected endings to achieve its base form (dictionary form), which is known as the lemma.

It’s a much more complicated and expensive process that requires an understanding of the context in which words appear in order to make decisions about what they mean. Hence, it uses a lexical vocabulary to derive the root form, is more time consuming than stemming, and is most likely to yield accurate results.

Lemmatization can be done with NLTK using WordNetLemmatizer, which uses a lexical database called WordNet (a detailed explanation of the WordNet database will be in a later section).

NLTK provides an interface for the WordNet database. WordNetLemmatizer uses the interface to derive the lemma of a given word.

When using the WordNetLemmatizer, we should specify which part of speech should be used in order to derive the accurate lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), or Adverb(r). The following code snippet shows how lemmatization in action.

lemmatize is a function to demonstrate how the lemma changes with the part of speech given.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem import WordNetLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "runningn"
     ]
    }
   ],
   "source": [
    "lemmatizer = WordNetLemmatizer()n",
    "print(lemmatizer.lemmatize("running"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lemmatize(word):n",
    "    lemmatizer = WordNetLemmatizer()n",
    "    print("verb form: " + lemmatizer.lemmatize(word, pos="v"))n",
    "    print("noun form: " + lemmatizer.lemmatize(word, pos="n"))n",
    "    print("adverb form: " + lemmatizer.lemmatize(word, pos="r"))n",
    "    print("adjective form: " + lemmatizer.lemmatize(word, pos="a"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "verb form: earsn",
      "noun form: earn",
      "adverb form: earsn",
      "adjective form: earsn"
     ]
    }
   ],
   "source": [
    "lemmatize("ears")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "verb form: runn",
      "noun form: runningn",
      "adverb form: runningn",
      "adjective form: runningn"
     ]
    }
   ],
   "source": [
    "lemmatize("running")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Stemming vs Lemmatization

Usage of either stemming or lemmatization will mostly depend on the situation at hand. If speed is required, it’s better to resort to stemming. But if accuracy is required it’s best to use lemmatization.

The following code snippet shows the comparison between stemming and lemmatization.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem import PorterStemmern",
    "from nltk.stem import WordNetLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "stemmer = PorterStemmer();n",
    "lemmatizer = WordNetLemmatizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "deactivn",
      "deactivn",
      "deactivn"
     ]
    }
   ],
   "source": [
    "print(stemmer.stem("deactivating"))n",
    "print(stemmer.stem("deactivated"))n",
    "print(stemmer.stem("deactivates"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "deactivaten",
      "deactivaten",
      "deactivaten"
     ]
    }
   ],
   "source": [
    "print(lemmatizer.lemmatize("deactivating", pos="v"))n",
    "print(lemmatizer.lemmatize("deactivating", pos="v"))n",
    "print(lemmatizer.lemmatize("deactivating", pos="v"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stonen",
      "speakn",
      "bedroomn",
      "joken",
      "lisan",
      "purpln"
     ]
    }
   ],
   "source": [
    "print(stemmer.stem('stones')) n",
    "print(stemmer.stem('speaking')) n",
    "print(stemmer.stem('bedroom')) n",
    "print(stemmer.stem('jokes')) n",
    "print(stemmer.stem('lisa')) n",
    "print(stemmer.stem('purple')) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stonen",
      "speakingn",
      "bedroomn",
      "joken",
      "lisan",
      "purplen"
     ]
    }
   ],
   "source": [
    "print(lemmatizer.lemmatize('stones')) n",
    "print(lemmatizer.lemmatize('speaking'))n",
    "print(lemmatizer.lemmatize('bedroom'))n",
    "print(lemmatizer.lemmatize('jokes'))n",
    "print(lemmatizer.lemmatize('lisa'))n",
    "print(lemmatizer.lemmatize('purple'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Part-Of-Speech (POS) Tagging

Part-Of-Speech tagging (or POS tagging) is also a very import component of NLP. The purpose of the POS tagging is to assign labels for each token (a word in this case) with its respective grammatical component, such as noun, verb, adjective, or adverb. Most POS are divided into sub-classes.

POS tagging can be identified as a supervised machine learning solution, mainly because it takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.

The most popular tag set for POS tagging is Penn Treebank tagset. Most of the trained POS taggers for English are trained on this tag set. The following link shows the available POS Tags in Penn Treebank tagset.

NLTK provides a function called pos_tag , to perform POS tagging of sentences, but this requires the sentence to be tokenized first. The following code snippet shows how POS tagging can be performed with NLTK:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import word_tokenize, pos_tag"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "The hardest choices require the strongest wills""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['The', 'hardest', 'choices', 'require', 'the', 'strongest', 'wills']n"
     ]
    }
   ],
   "source": [
    "sentence_tokens = word_tokenize(sentence)n",
    "print(sentence_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('The', 'DT'),n",
       " ('hardest', 'JJS'),n",
       " ('choices', 'NNS'),n",
       " ('require', 'VBP'),n",
       " ('the', 'DT'),n",
       " ('strongest', 'JJS'),n",
       " ('wills', 'NNS')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos_tag(sentence_tokens)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Chunking

Chunking or shallow parsing is a process that extracts phrases from a text sample. Here we extract chunks of sentences that constitute meaning rather than identifying the sentence’s structure. This is different and more advanced than tokenization because it extracts phrases instead of tokens.

As an example, the word “North America” can be extracted as a single phrase using chunking rather than two separate words “North” and “America” as tokenization does.

Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Same as in POS tags, there is a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.

As an example, let’s consider noun phrase chunking. In order to do this, we search for chunks corresponding to an individual noun phrase for a given rule. To create a NP chunk, we define the chunk grammar rule using POS tags. We will define this using a regular expression rule:

The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN), then the NP chunk should be formed.

This way we can use grammar rules to extract NPs from POS tagged sentences:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import pos_tag, word_tokenize, RegexpParser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "the big visious dog barked at the small feeble cat"n",
    "# sentence = "the little yellow hard tight dog barked at the cat""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Define your grammar using regular expressionsn",
    "grammar = ('''NP: {<DT>?<JJ>*<NN>} # NP''')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('the', 'DT'),n",
       " ('big', 'JJ'),n",
       " ('visious', 'JJ'),n",
       " ('dog', 'NN'),n",
       " ('barked', 'VBD'),n",
       " ('at', 'IN'),n",
       " ('the', 'DT'),n",
       " ('small', 'JJ'),n",
       " ('feeble', 'JJ'),n",
       " ('cat', 'NN')]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunkParser = RegexpParser(grammar)n",
    "tagged = pos_tag(word_tokenize(sentence))n",
    "tagged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = chunkParser.parse(tagged)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Sn",
      "  (NP the/DT big/JJ visious/JJ dog/NN)n",
      "  barked/VBDn",
      "  at/INn",
      "  (NP the/DT small/JJ feeble/JJ cat/NN))n",
      "(NP the/DT big/JJ visious/JJ dog/NN)n",
      "(NP the/DT small/JJ feeble/JJ cat/NN)n"
     ]
    }
   ],
   "source": [
    "for subtree in tree.subtrees():n",
    "    print(subtree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree.draw()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Stop Word Removal

Stop words are simply words that have very little meaning and are mostly used as part of the grammatical structure of a sentence. Words like “the”, “a”, “an”, “in”, etc. are considered stop-words.

Even though it doesn’t seem like much, stop word removal plays an important role when dealing with tasks such as sentiment analysis. This process is also used by search engines when indexing entries of a search query.

NLTK comes with the corpora stopwords which contains stop word lists for 16 different languages. No direct function is given by NLTK to remove stop words, but we can use the list to programmatically remove them from sentences.

If we are dealing with many sentences, first the text must be split into sentences using sent_tokenize. Then using word_tokenize, we can further break the sentences into words, and then remove the stop words using the list. The following code snippet depicts this process:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import stopwordsn",
    "from nltk.tokenize import word_tokenize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]n"
     ]
    }
   ],
   "source": [
    "print(stopwords.words('english'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "Success is not final. Failure is not fatal. It is the courage to continue that counts.""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Success', 'is', 'not', 'final', '.', 'Failure', 'is', 'not', 'fatal', '.', 'It', 'is', 'the', 'courage', 'to', 'continue', 'that', 'counts', '.']n"
     ]
    }
   ],
   "source": [
    "word_tokens = word_tokenize(sentence)n",
    "print(word_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_tokens = word_tokens[:] n",
    "for token in word_tokens:n",
    "    if token in stopwords.words('english'):n",
    "        clean_tokens.remove(token)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Success', 'final', '.', 'Failure', 'fatal', '.', 'It', 'courage', 'continue', 'counts', '.']n"
     ]
    }
   ],
   "source": [
    "print(clean_tokens)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Named Entity Recognition

Named entity recognition (NER), is the process of identifying entities such as Names, Locations, Dates, or Organizations that exist in an unstructured text sample.

The purpose of NER is to be able to map the extracted entities against a knowledge base, or to extract relationships between different entities. Eg: Who did what? or Where something take place? or At what time something occur?

It’s a very important task when dealing with information extraction. Other applications where NER is used:

  • Classifying content (in news, law domains)
  • For efficient search algorithms
  • In content recommendation algorithms
  • Chatbots, voice assistants, etc.

For domain-specific entities, in a field like medicine or law, we’ll need to train our own NER algorithm.

For casual use, NLTK provides us with a method called ne_chunk to perform NER on a given text. In order to use ne_chunk, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type. In this case, Mark and John are of type PERSON, Google and Yahoo are of type ORGANIZATION, and New York City is of type GPE (which indicates location).

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import word_tokenize, pos_tag, ne_chunk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "Mark who works at Yahoo and John who works at Google decided to meet at New York City""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Sn",
      "  (PERSON Mark/NNP)n",
      "  who/WPn",
      "  works/VBZn",
      "  at/INn",
      "  (ORGANIZATION Yahoo/NNP)n",
      "  and/CCn",
      "  (PERSON John/NNP)n",
      "  who/WPn",
      "  works/VBZn",
      "  at/INn",
      "  (ORGANIZATION Google/NNP)n",
      "  decided/VBDn",
      "  to/TOn",
      "  meet/VBn",
      "  at/INn",
      "  (GPE New/NNP York/NNP City/NNP))n"
     ]
    }
   ],
   "source": [
    "print (ne_chunk(pos_tag(word_tokenize(sentence))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = "The Avengers began as a group of extraordinary individuals who were assembled to defeat \n",
    "Loki and his chitauri army in New York City. ""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Sn",
      "  The/DTn",
      "  (ORGANIZATION Avengers/NNP)n",
      "  began/VBDn",
      "  as/INn",
      "  a/DTn",
      "  group/NNn",
      "  of/INn",
      "  extraordinary/JJn",
      "  individuals/NNSn",
      "  who/WPn",
      "  were/VBDn",
      "  assembled/VBNn",
      "  to/TOn",
      "  defeat/VBn",
      "  (PERSON Loki/NNP)n",
      "  and/CCn",
      "  his/PRP$n",
      "  chitauri/NNn",
      "  army/NNn",
      "  in/INn",
      "  (GPE New/NNP York/NNP City/NNP)n",
      "  ./.)n"
     ]
    }
   ],
   "source": [
    "print (ne_chunk(pos_tag(word_tokenize(sentence))))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

WordNet Interface

WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

NLTK provides an interface for the NLTK database, and it comes with the corpora module. WordNet is composed of approximately 155,200 words and 117,600 synonym sets that are logically related to each other.

As an example, in WordNet, a word like computer has two possible contexts (one being a machine for performing computation, and the other being a calculator: which is associated to computer in a lexical sense). It is identified by computer.n.01 (is known as the “lemma code name”. And letter n depicts that the word is a noun).

We can further analyze the synset to find other words associated with it. As you can see all the words that are closely associated (and in the same context) with the word computer are listed:

Using WordNet, we’re able to find the definition of a particular word and also the usages of a word (the database may or may not contain usages for words):

Also, we can use it to find synonyms and antonyms of words. The following snippet contains all the code mentioned here and also shows how to retrieve synonyms and antonyms for a particular word:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import wordnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('computer.n.01'), Synset('calculator.n.01')]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synsets("computer")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['computer',n",
       " 'computing_machine',n",
       " 'computing_device',n",
       " 'data_processor',n",
       " 'electronic_computer',n",
       " 'information_processing_system']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "syn = wordnet.synset('computer.n.01')n",
    "syn.lemma_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a machine for performing calculations automatically'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "syn.definition()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['he needs a car to get to work']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wordnet.synset("car.n.01").examples()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['large', 'large', 'big', 'large', 'bombastic', 'declamatory', 'large', 'orotund', 'tumid', 'turgid', 'big', 'large', 'magnanimous', 'big', 'large', 'prominent', 'large', 'big', 'enceinte', 'expectant', 'gravid', 'great', 'large', 'heavy', 'with_child', 'large', 'large', 'boastfully', 'vauntingly', 'big', 'large']n"
     ]
    }
   ],
   "source": [
    "synonyms = []n",
    "for syn in wordnet.synsets('large'):n",
    "    for lemma in syn.lemmas():n",
    "        synonyms.append(lemma.name())n",
    "print(synonyms)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['small', 'little']n"
     ]
    }
   ],
   "source": [
    "antonyms = []n",
    "for syn in wordnet.synsets("large"):n",
    "    for l in syn.lemmas():n",
    "        if l.antonyms():n",
    "            antonyms.append(l.antonyms()[0].name())n",
    "print(antonyms)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

References

Conclusion

In this introductory article, we discussed how to use NLTK in order to perform some basic but useful tasks in Natural Language Processing. We learned tasks such as tokenization, stemming, lemmatization, stop word removal, POS tagging, chunking, named entity recognition, and some basics surrounding the WordNet interface.

Hope you found the article useful!

The source code that created this post can be found below.

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email.

Email address: [email protected]

Discuss this post on Hacker News and Reddit.

Avatar photo

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *