NLP Chronicles: spaCy, the NLP Library Built for Production

The Up-and-Coming Champion of Natural Language Processing

If you’re familiar with natural language processing or starting to learn about it, you might have come across NLTK (Natural Language Tool Kit), Stanford Core NLP, etc.

Think of it this way. Natural language processing (in Python) is a kingdom, and NLTK is the king. Everyone admires the king and respects what he has done for the kingdom. But there comes a time when every king should step down and make way for the next generation

spaCy, the prince, is an emerging champion built to succeed the reigning king. This library has gained popularity over the past couple of years and is steadily gaining the admiration of NLP practitioners.

In this article, I’ll show you what spaCy is, what makes it special, and how you can use it for NLP tasks.

What is spaCy?

spaCy is a fairly new library to join the NLP world. But it’s gaining popularity quite steadily, and there are some really good reasons for this momentum.

What makes spaCy special?

spaCy claims to be an industrial-strength NLP library. This means a few things:

  • You can use spaCy in production environments, and it will work as efficiently as expected.
  • It’s very fast. Of course, it should be. After all, it is written in Cython (Cython is a superset of Python that enhances the performance of Python to C language level).
  • It’s very accurate. In fact, it’s one of the most accurate NLP libraries to date.
  • It’s minimalistic and opinionated. spaCy doesn’t bombard you with many options to choose from. It just provides one algorithm for each task. And that algorithm is often the best (and it constantly gets perfected and improved). So instead of choosing what algorithm to use, you can be productive and just get your work done.
  • It’s highly extensible. With the bloom of machine learning (ML) and deep learning (DL), text data comes into play for many of its applications. spaCy can be used alongside other popular ML and DL libraries such as scikit-learn, TensorFlow, and more.
  • It supports several languages. Currently spaCy supports: German, Spanish, Greek, French, Italian, Dutch, and Portuguese, apart from English. For a complete list, follow this link.
  • It’s customizable. You can add custom components or add your own implementation where needed with spaCy.

How is spaCy different from NLTK?

Purpose

The primary difference between spaCy and NLTK is the purposes that they were built for.

NLTK was built with learning in mind. It is a great toolkit for teaching, learning, and experimenting with NLP. But spaCy was built with production-readiness in mind, focusing more on efficiency and performance.

Ease of use and learning time

NLTK provides several choices of algorithms for tasks, and you can choose which algorithm suits a given task. This algorithm selection process can be a time-consuming and sometimes unwanted thing to do.

However spaCy doesn’t make you choose the correct algorithm. Instead, it often provides the best and most efficient algorithm for a particular task without wasting any time.

Approach to handling text

NLTK processes and manipulates strings to perform NLP tasks. It has methods for each task—sent_tokenize for sentence tokenizing, pos_tag for part-of-speech tagging, etc. You have to select which method to use for the task at hand and feed in relevant inputs.

On the other hand, spaCy follows an object-oriented approach in handling the same tasks. The text is processed in a pipeline and stored in an object, and that object contains attributes and methods for various NLP tasks. This approach is more versatile and in alignment with modern Python programming.

How to use spaCy for NLP tasks

Now comes the exciting part where you get to see spaCy in action. In this section, I’ll demonstrate how to perform basic NLP tasks with spaCy using practical examples.

But before we get started, you might want to brush up your basics of natural language processing.

If you need a refreshers, head on over to the first article in this series: NLP Chronicles: Intro to NLP with NLTK:

Here’s what we’ll cover (you can jump to a given section by following its link):

Prerequisites

You need to have the following libraries installed on your machine:

  • Python 3.+
  • Jupyter Notebook

Also, you can use Google Colab instead of setting up the machine on your own. Here’s a getting started guide for Colab.

spaCy Installation

spaCy installation is quite easy. Just run the following commands, and you’ll have spaCy installed in no time.

But installing just spaCy won’t be enough. Because in order to work with spaCy, you’ll need to download language-specific models manually.

The following code will download the English language model. In the same way, you can download models for other available languages:

Tokenization

Tokenization is the process of segmenting text into words, punctuation etc. spaCy tokenizes the text, processes it, and stores the data in the Doc object.

The following figure shows the process of tokenization in spaCy.

In the following code snippet, you can observe that it’s possible to access tokens using token.text in the Token object.

First, you have to import spaCy first. The nlp method requires the model we downloaded earlier (details about the model will be explained in the next section).

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "tokenization-spacy",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/b841b42a61a8f6180ef36b14d2198189/tokenization-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "330vvCB_wFrp",
        "colab_type": "text"
      },
      "source": [
        "# Tokenization"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "59VlWECYHoNg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KU-BUWfolIK8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F3eftd2ulL5E",
        "colab_type": "code",
        "outputId": "5a503477-9db6-41fd-8cfa-323e4d13c025",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 372
        }
      },
      "source": [
        "doc = nlp(u"The foundation stones for a balanced success are honesty, character, integrity, faith, love and loyalty.")n",
        "for token in doc:n",
        "    print(token.text)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Then",
            "foundationn",
            "stonesn",
            "forn",
            "an",
            "balancedn",
            "successn",
            "aren",
            "honestyn",
            ",n",
            "charactern",
            ",n",
            "integrityn",
            ",n",
            "faithn",
            ",n",
            "loven",
            "andn",
            "loyaltyn",
            ".n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

Here’s something interesting—After processing the text, spaCy keeps all the information about the original text intact within the Doc object.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "tokenization-with-original-text-spacy",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/ae69fec1d2e5f6cdbfded0a8082728c2/tokenization-with-original-text-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8gAfBTc-v76r",
        "colab_type": "text"
      },
      "source": [
        "# Tokenization With Reference To Original Text"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "59VlWECYHoNg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KU-BUWfolIK8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F3eftd2ulL5E",
        "colab_type": "code",
        "outputId": "946209cb-d34c-4eab-e6b7-c2acea5ebb1a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 408
        }
      },
      "source": [
        "doc = nlp(u"The foundation stones for a balanced success are honesty,   character, integrity, faith, love and loyalty.  ")n",
        "for token in doc:n",
        "    print(token.text, token.idx)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "The 0n",
            "foundation 4n",
            "stones 15n",
            "for 22n",
            "a 26n",
            "balanced 28n",
            "success 37n",
            "are 45n",
            "honesty 49n",
            ", 56n",
            "   58n",
            "character 60n",
            ", 69n",
            "integrity 71n",
            ", 80n",
            "faith 82n",
            ", 87n",
            "love 89n",
            "and 94n",
            "loyalty 98n",
            ". 105n",
            "  107n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

token.idx keeps the position of the characters of the text. As you can see from the code snippet above, all the extra and trailing spaces are also preserved.

Because of this clever design, you can reconstruct the original text, white spaces included. Also, it helps in situations where you need to replace words in the original text or when annotating the text.

For more information about tokenization, follow this link.

Dependency Parsing

Dependency parsing is the process of assigning syntactic dependency labels that describe the relationships between individual tokens, like subject or object.

After you call nlp in spaCy, the input text is first tokenized and the Doc object is created.

The Doc object goes through several phases of processing in a pipeline. This pipeline, unsurprisingly, is called the processing pipeline.

The dependency parser also uses the statistical model in order to compute the dependency labels for the tokens.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "dependency-parsing-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/07ec63e3bffd59c15bb13c08fa1a3660/dependency-parsing-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zzOZP01BHUzy",
        "colab_type": "text"
      },
      "source": [
        "# Dependency Parcing"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxw0W9hGdkEY",
        "colab_type": "code",
        "outputId": "ffde7f1e-74f3-4f72-8f20-b7bcb528b0f4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 266
        }
      },
      "source": [
        "doc = nlp(u"First American Financial exposed 16 years worth of mortgage paperwork including bank accounts.")n",
        "for token in doc:n",
        "    print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "First compound Financial PROPN []n",
            "American compound Financial PROPN []n",
            "Financial nsubj exposed VERB [First, American]n",
            "exposed ROOT exposed VERB [Financial, worth, .]n",
            "16 nummod years NOUN []n",
            "years npadvmod worth ADJ [16]n",
            "worth prep exposed VERB [years, of, including]n",
            "of prep worth ADJ [paperwork]n",
            "mortgage compound paperwork NOUN []n",
            "paperwork pobj of ADP [mortgage]n",
            "including prep worth ADJ [accounts]n",
            "bank compound accounts NOUN []n",
            "accounts pobj including VERB [bank]n",
            ". punct exposed VERB []n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}
  • text → text: The original token text.
  • dep_ → dep: The syntactic relation connecting child to head.
  • head.text → head text: The original text of the token head.
  • head.pos_ → head POS: The part-of-speech tag of the token head.
  • children tokens: The immediate syntactic dependents of the token.

spaCy provides a convenient way to view the dependency parser in action, using its own visualization library called displaCy.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "dependency-parsing-visualization-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/4c9cc29de5f2cb4f95a8d79140da9d5a/dependency-parsing-visualization-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cybet0iBHeCp",
        "colab_type": "text"
      },
      "source": [
        "# Visualising Dependency Parsing"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "outputId": "03a3cc67-843b-4fe0-b296-798276166417",
        "id": "3nFD-u2fHnQR",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 266
        }
      },
      "source": [
        "doc = nlp(u"First American Financial exposed 16 years worth of mortgage paperwork including bank accounts.")n",
        "for token in doc:n",
        "    print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "First compound Financial PROPN []n",
            "American compound Financial PROPN []n",
            "Financial nsubj exposed VERB [First, American]n",
            "exposed ROOT exposed VERB [Financial, worth, .]n",
            "16 nummod years NOUN []n",
            "years npadvmod worth ADJ [16]n",
            "worth prep exposed VERB [years, of, including]n",
            "of prep worth ADJ [paperwork]n",
            "mortgage compound paperwork NOUN []n",
            "paperwork pobj of ADP [mortgage]n",
            "including prep worth ADJ [accounts]n",
            "bank compound accounts NOUN []n",
            "accounts pobj including VERB [bank]n",
            ". punct exposed VERB []n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VEwZvl2BhJXq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from spacy import displacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Go9CZob4hEIb",
        "colab_type": "code",
        "outputId": "9f26da31-b539-4285-9f91-d3c611029328",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 357
        }
      },
      "source": [
        "displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="895-0" class="displacy" width="1350" height="337.0" style="max-width: none; height: 337.0px; color: #000000; background: #ffffff; font-family: Arial">n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="50">First</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PROPN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="150">American</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="150">PROPN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="250">Financial</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="250">PROPN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="350">exposed</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="350">VERB</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="450">16</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="450">NUM</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="550">years</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="550">NOUN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="650">worth</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="650">ADJ</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="750">of</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="750">ADP</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="850">mortgage</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="850">NOUN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="950">paperwork</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="950">NOUN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="1050">including</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="1050">VERB</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="1150">bank</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="1150">NOUN</tspan>n",
              "</text>n",
              "n",
              "<text class="displacy-token" fill="currentColor" text-anchor="middle" y="247.0">n",
              "    <tspan class="displacy-word" fill="currentColor" x="1250">accounts.</tspan>n",
              "    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="1250">NOUN</tspan>n",
              "</text>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-0" stroke-width="2px" d="M70,202.0 C70,102.0 240.0,102.0 240.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-0" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">compound</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M70,204.0 L62,192.0 78,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-1" stroke-width="2px" d="M170,202.0 C170,152.0 235.0,152.0 235.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-1" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">compound</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M170,204.0 L162,192.0 178,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-2" stroke-width="2px" d="M270,202.0 C270,152.0 335.0,152.0 335.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-2" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">nsubj</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M270,204.0 L262,192.0 278,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-3" stroke-width="2px" d="M470,202.0 C470,152.0 535.0,152.0 535.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-3" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">nummod</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M470,204.0 L462,192.0 478,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-4" stroke-width="2px" d="M570,202.0 C570,152.0 635.0,152.0 635.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-4" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">npadvmod</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M570,204.0 L562,192.0 578,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-5" stroke-width="2px" d="M370,202.0 C370,52.0 645.0,52.0 645.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-5" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">prep</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M645.0,204.0 L653.0,192.0 637.0,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-6" stroke-width="2px" d="M670,202.0 C670,152.0 735.0,152.0 735.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-6" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">prep</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M735.0,204.0 L743.0,192.0 727.0,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-7" stroke-width="2px" d="M870,202.0 C870,152.0 935.0,152.0 935.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-7" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">compound</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M870,204.0 L862,192.0 878,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-8" stroke-width="2px" d="M770,202.0 C770,102.0 940.0,102.0 940.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-8" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">pobj</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M940.0,204.0 L948.0,192.0 932.0,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-9" stroke-width="2px" d="M670,202.0 C670,2.0 1050.0,2.0 1050.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-9" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">prep</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M1050.0,204.0 L1058.0,192.0 1042.0,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-10" stroke-width="2px" d="M1170,202.0 C1170,152.0 1235.0,152.0 1235.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-10" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">compound</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M1170,204.0 L1162,192.0 1178,192.0" fill="currentColor"/>n",
              "</g>n",
              "n",
              "<g class="displacy-arrow">n",
              "    <path class="displacy-arc" id="arrow-895-0-11" stroke-width="2px" d="M1070,202.0 C1070,102.0 1240.0,102.0 1240.0,202.0" fill="none" stroke="currentColor"/>n",
              "    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">n",
              "        <textPath xlink:href="#arrow-895-0-11" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">pobj</textPath>n",
              "    </text>n",
              "    <path class="displacy-arrowhead" d="M1240.0,204.0 L1248.0,192.0 1232.0,192.0" fill="currentColor"/>n",
              "</g>n",
              "</svg>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    }
  ]
}

Dependency parser is also used in sentence boundary detection, and also lets you iterate over computed noun chunks.

Follow this link for more information about dependency parsing in spaCy.

Chunking

Chunking is the process of extracting noun phrases from the text.

spaCy can identify noun phrases (or noun chunks), as well. You can think of noun chunks as a noun plus the words describing the noun. It’s also possible to identify and extract the base-noun of a given chunk.

eg: In example 01 in the following code snippet, “Tall big tree is in the vast garden” → The words “tall” and “big” describe the noun “tree”, and “vast” describes the noun “garden”.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "chunking-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/2f9f7da5a7297cfd015f5b4fb96a9310/chunking-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sHxHZs2mb_kh",
        "colab_type": "text"
      },
      "source": [
        "# Chunking"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "svc4tpfsb94B",
        "colab_type": "text"
      },
      "source": [
        "## Example 01"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxw0W9hGdkEY",
        "colab_type": "code",
        "outputId": "4990b759-cc81-4391-ed6e-b17552ee880e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        }
      },
      "source": [
        "doc = nlp(u"Tall big tree is in the vast garden.")n",
        "for chunk in doc.noun_chunks:n",
        "    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Tall big tree tree nsubj isn",
            "the vast garden garden pobj inn"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TD_QGhQUcIcQ",
        "colab_type": "text"
      },
      "source": [
        "## Example 02"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ENNbuKeMcLvw",
        "colab_type": "code",
        "outputId": "daec8dec-4105-483f-d1f0-4b7e579c1147",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        }
      },
      "source": [
        "doc = nlp(u"First American Financial exposed 16 years worth of mortgage paperwork including bank accounts.")n",
        "for chunk in doc.noun_chunks:n",
        "    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "First American Financial Financial nsubj exposedn",
            "mortgage paperwork paperwork pobj ofn",
            "bank accounts accounts pobj includingn"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}
  • text → text: The original noun chunk text.
  • root.text → root text: The original text of the word connecting the noun chunk to the rest of the parse.
  • root.dep_ → root dep: Dependency relation connecting the root to its head.
  • root.head.text → root head text: The text of the root token’s head.

You can find more information about chunking in this link.

Sentence Boundary Detection

This is the process of identifying and splitting text into individual sentences.

Typically, most NLP libraries use a rule-based approach when obtaining sentence boundaries. However spaCy follows a different approach for this task.

spaCy uses dependency parsing in order to detect sentences using the statistical model. This is more accurate than the classical rule-based approach.

Traditional rule-based sentence splitting will work on general purpose text, but may not work as intended when it comes to social media or conversational text. Since spaCy uses a prediction-based approach, the accuracy of sentence splitting tends to be higher.

By accessing the Doc.sents property of the Doc object, we can get the sentences as in the code snippet below.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "sentence-splitting-spacy",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/cc97bfca94254db06358889037ba7a34/sentence-splitting-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eoHtSMHOA-it",
        "colab_type": "text"
      },
      "source": [
        "# Sentence Boundary Detection"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "59VlWECYHoNg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KU-BUWfolIK8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F3eftd2ulL5E",
        "colab_type": "code",
        "outputId": "c113b4ac-343e-4b7a-8a23-44df9d18a048",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        }
      },
      "source": [
        "doc = nlp(u"Success is not final. Failure is not fatal. It is the courage to continue that counts.")n",
        "for sent in doc.sents:n",
        "    print(sent)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Success is not final.n",
            "Failure is not fatal.n",
            "It is the courage to continue that counts.n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dU8tWj6MYoTF",
        "colab_type": "code",
        "outputId": "d4650e68-2ae5-4bb9-ea9b-6fae4d6f9ff7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        }
      },
      "source": [
        "doc = nlp(u"Success is not final :) :) Failure is not fatal :( :( It is the courage to continue that counts !!!")n",
        "for sent in doc.sents:n",
        "    print(sent)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Success is not final :) :)n",
            "Failure is not fatal :( :(n",
            "It is the courage to continue that counts !!!n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

In the second example, I’ve added a few emojis to the text. As you can see, spaCy can identify the emojis in the text correctly and split into sentences as intended.

Part-of-Speech (POS) Tagging

POS tagging is done by assigning word types to tokens, like a verb or noun.

After tokenization, the text goes through parsing and tagging. With the use of the statistical model, spaCy can predict the most likely tag/label for a token in a given context.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "pos-tagging-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/4c0f15f60198fa7e91badbebb03302bb/pos-tagging-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HrdYaOIeE5d9",
        "colab_type": "text"
      },
      "source": [
        "# POS Tagging"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxw0W9hGdkEY",
        "colab_type": "code",
        "outputId": "f38c1e61-3c11-430d-b198-c58fbdc2ab91",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 284
        }
      },
      "source": [
        "doc = nlp(u"First American Financial exposed 16 years worth of mortgage paperwork, including bank accounts.")n",
        "for token in doc:n",
        "  print(token.text, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "First PROPN NNP Xxxxx True Falsen",
            "American PROPN NNP Xxxxx True Falsen",
            "Financial PROPN NNP Xxxxx True Falsen",
            "exposed VERB VBD xxxx True Falsen",
            "16 NUM CD dd False Falsen",
            "years NOUN NNS xxxx True Falsen",
            "worth ADJ JJ xxxx True Falsen",
            "of ADP IN xx True Truen",
            "mortgage NOUN NN xxxx True Falsen",
            "paperwork NOUN NN xxxx True Falsen",
            ", PUNCT , , False Falsen",
            "including VERB VBG xxxx True Falsen",
            "bank NOUN NN xxxx True Falsen",
            "accounts NOUN NNS xxxx True Falsen",
            ". PUNCT . . False Falsen"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

From the above code snippet, the attributes of the token object represent the following.

  • text → text: The original word text.
  • pos_ → POS: The simple part-of-speech tag.
  • tag_ → tag: The detailed part-of-speech tag.
  • shape_ → shape: The word shape — capitalization, punctuation, digits.
  • is_alpha → is alpha: Is the token of an alpha character.
  • is_stop → is stop: Is the token part of a stop list, i.e. the most common words of the language.

You can get a description of the pos_ or tag_ by using the following command:

Named Entity Recognition

NER is done by labeling words/tokens—named “real-world” objects—like persons, companies, or locations.

spaCy’s statistical model has been trained to recognize various types of named entities, such as names of people, countries, products, etc.

The predictions of these entities might not always work perfectly because the statistical model may not be trained on the examples that you require. In such a case, you can tune the model to suit your needs.

Follow this link for a full list of named entities supported by spaCy.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "named-entity-recognition-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/2e5539a519efb259be209261bd74eabd/named-entity-recognition-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wjEjGeh4j5o0",
        "colab_type": "text"
      },
      "source": [
        "# Named Entity Recognition"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "467Y6GGqnvVM",
        "colab_type": "text"
      },
      "source": [
        "## Example 01"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxw0W9hGdkEY",
        "colab_type": "code",
        "outputId": "ad5effb2-db94-47a8-ee45-e04ed7e17a85",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 141
        }
      },
      "source": [
        "doc = nlp(u"I am planning to go to London in the morning at 10am, I have to buy a HP laptop and 2 speakers for less than 1000 dollars. I hope America and China tradewar won't affect prices.")n",
        "for ent in doc.ents:n",
        "  print(ent.text, ent.label_, ent.start_char, ent.end_char)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "London GPE 23 29n",
            "10am TIME 48 52n",
            "HP PRODUCT 70 72n",
            "2 CARDINAL 84 85n",
            "less than 1000 dollars MONEY 99 121n",
            "America GPE 130 137n",
            "China GPE 142 147n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vQP7jVdRn2QK",
        "colab_type": "text"
      },
      "source": [
        "## Example 02"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "E9o0KhhFmktZ",
        "colab_type": "code",
        "outputId": "1ea7b696-5c91-4398-9fc8-ffa88f2b439f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        }
      },
      "source": [
        "doc = nlp(u"Russian armies were mobilized because of ISIS attacks in Syria on 25th of May")n",
        "for ent in doc.ents:n",
        "  print(ent.text, ent.label_, ent.start_char, ent.end_char)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Russian NORP 0 7n",
            "Syria GPE 57 62n",
            "25th of May DATE 66 77n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

As you can see, spaCy can accurately identify most entities. Using displaCy you can view the identified entities:

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "named-entity-recognition-visualization-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/46f55fd7791a23625a789e6326093c71/named-entity-recognition-visualization-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wjEjGeh4j5o0",
        "colab_type": "text"
      },
      "source": [
        "# Visualizing Named Entitiesn"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacyn",
        "from spacy import displacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "467Y6GGqnvVM",
        "colab_type": "text"
      },
      "source": [
        "## Example 01"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zxw0W9hGdkEY",
        "colab_type": "code",
        "outputId": "901cfe24-eaba-4c89-c2fe-658c6039791b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 86
        }
      },
      "source": [
        "doc = nlp(u"I am planning to go to London in the morning at 10am, I have to buy a HP laptop and 2 speakers for less than 1000 dollars. I hope America and China tradewar won't affect prices.")n",
        "displacy.render(doc, style='ent', jupyter=True)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class="entities" style="line-height: 2.5">I am planning to go to n",
              "<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    Londonn",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>n",
              "</mark>n",
              " in the morning at n",
              "<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    10amn",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">TIME</span>n",
              "</mark>n",
              ", I have to buy a n",
              "<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    HPn",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span>n",
              "</mark>n",
              " laptop and n",
              "<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    2n",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">CARDINAL</span>n",
              "</mark>n",
              " speakers for n",
              "<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    less than 1000 dollarsn",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span>n",
              "</mark>n",
              ". I hope n",
              "<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    American",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>n",
              "</mark>n",
              " and n",
              "<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    Chinan",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>n",
              "</mark>n",
              " tradewar won't affect prices.</div>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vQP7jVdRn2QK",
        "colab_type": "text"
      },
      "source": [
        "## Example 02"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "E9o0KhhFmktZ",
        "colab_type": "code",
        "outputId": "9a9aa7f2-38b5-4826-a100-ef4cb8a02dce",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        }
      },
      "source": [
        "doc = nlp(u"Russian armies were mobilized because of ISIS attacks in Syria on 25th of May")n",
        "displacy.render(doc, style='ent', jupyter=True)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class="entities" style="line-height: 2.5">n",
              "<mark class="entity" style="background: #c887fb; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    Russiann",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">NORP</span>n",
              "</mark>n",
              " armies were mobilized because of ISIS attacks in n",
              "<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    Syrian",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>n",
              "</mark>n",
              " on n",
              "<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">n",
              "    25th of Mayn",
              "    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span>n",
              "</mark>n",
              "</div>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    }
  ]
}

Lemmatization

Lemmatization is the assigning of the base forms of words. For example: “was” → “be” or “cats” → “cat”

To perform lemmatization, the Doc object needs to be parsed. The processed Doc object contains the lemma of words.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "lemmatization-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "TPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/3652da22364172894b915f6b26152078/lemmatization-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sHxHZs2mb_kh",
        "colab_type": "text"
      },
      "source": [
        "# Lemmatization"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load("en_core_web_sm")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PvH_iFUo9I66",
        "colab_type": "text"
      },
      "source": [
        "## Example 01"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ENNbuKeMcLvw",
        "colab_type": "code",
        "outputId": "be93f4aa-3f3e-4801-caae-d735d11903cc",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 106
        }
      },
      "source": [
        "doc = nlp(u"Success is not final.")n",
        "for token in doc:n",
        "    print(token.text, token.lemma_, token.dep_)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Success success nsubjn",
            "is be ROOTn",
            "not not negn",
            "final final acompn",
            ". . punctn"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8NSq8Zvh9LrJ",
        "colab_type": "text"
      },
      "source": [
        "## Example 02"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LYPWKqOj9N9K",
        "colab_type": "code",
        "outputId": "a92346d7-9868-4881-a26f-d7d6803ad832",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 141
        }
      },
      "source": [
        "doc = nlp(u"Men are climbing up the trees.")n",
        "for token in doc:n",
        "    print(token.text, token.lemma_, token.dep_)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Men man nsubjn",
            "are be auxn",
            "climbing climb ROOTn",
            "up up prtn",
            "the the detn",
            "trees tree dobjn",
            ". . punctn"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

Word Vectors Similarity

Word vectors similarity is determined by comparing word vector representations of a word. Word vectors can be generated by using an algorithm like word2vec.

This feature also needs the statistical model. However, the default model doesn’t come with word vectors. So you’ll have to download a larger model for that.

The following command downloads the model.

You can access a vector of a word as follows:

The model’s vocabulary contains vectors for most words in the language. Words like “man”, “vehicle”, and “school” are fairly common words, and their vectors can be accessed as shown below

If the word vector isn’t in the vocabulary, then it doesn’t have a vector representation. In the above example, the word “jfido” is such a word.

We can identify if a word is out of the vocabulary using the is_oov attribute.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "word-vectors-01-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/159593c97de120ade364890c02ff61fd/word-vectors-01-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rtA1nfeyV7G2",
        "colab_type": "text"
      },
      "source": [
        "# Word Vectors"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jOXX_bjaHKA_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#!python -m spacy download en_core_web_lg"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load('en_core_web_lg')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oltxz_rwLuh4",
        "colab_type": "code",
        "outputId": "3b57c24f-075d-4e37-e3f1-29d391219d49",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        }
      },
      "source": [
        "tokens = nlp(u'man vehicle school jfido')n",
        "n",
        "for token in tokens:n",
        "    print(token.text, token.has_vector, token.vector_norm, token.is_oov)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "man True 6.352939 Falsen",
            "vehicle True 7.1416836 Falsen",
            "school True 6.7380905 Falsen",
            "jfido False 0.0 Truen"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

spaCy can also compare the similarity between two objects and predict the similarity between them.

Doc, Span, and Token objects contain a method called .similarity to compute similarity.

As you can see from the snippet below, the similarity between “laptop” and “computer” is 0.677216, while the similarity between “ bus” and “laptop” is 0.2695869.

Related objects have a greater similarity score, while less related objects have a lower similarity score.

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "word-vectors-02-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/5bca3fa062d926268d8641cadf583512/word-vectors-02-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rtA1nfeyV7G2",
        "colab_type": "text"
      },
      "source": [
        "# Word Vectors"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jOXX_bjaHKA_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#!python -m spacy download en_core_web_lg"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load('en_core_web_lg')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KdxdxJAMrgCc",
        "colab_type": "code",
        "outputId": "34f8ccc8-4526-4ac6-9683-a05ddb16b695",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 301
        }
      },
      "source": [
        "tokens = nlp(u'car bus computer laptop')n",
        "for token1 in tokens:n",
        "    for token2 in tokens:n",
        "        print(token1.text, token2.text, token1.similarity(token2))"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "car car 1.0n",
            "car bus 0.48169604n",
            "car computer 0.3188663n",
            "car laptop 0.32531086n",
            "bus car 0.48169604n",
            "bus bus 1.0n",
            "bus computer 0.33506277n",
            "bus laptop 0.2695869n",
            "computer car 0.3188663n",
            "computer bus 0.33506277n",
            "computer computer 1.0n",
            "computer laptop 0.677216n",
            "laptop car 0.32531086n",
            "laptop bus 0.2695869n",
            "laptop computer 0.677216n",
            "laptop laptop 1.0n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

In a similar way, we can also find the similarity of sentences:

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "word-vectors-03-spacy.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "TPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href="https://colab.research.google.com/gist/LahiruTjay/031ae88d5cb6fc6914b3203b87eac282/word-vectors-03-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rtA1nfeyV7G2",
        "colab_type": "text"
      },
      "source": [
        "# Word Vectors"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jOXX_bjaHKA_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#!python -m spacy download en_core_web_lg"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jbq6hKSCdbZ_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fygTg7hSdhn9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "nlp = spacy.load('en_core_web_lg')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TQHuQ3CQrbwI",
        "colab_type": "code",
        "outputId": "36f799d7-2f9d-43d3-9d77-fd05f36aa519",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        }
      },
      "source": [
        "target = nlp("Cats are beautiful animals.")n",
        " n",
        "doc1 = nlp("Dogs are awesome.")n",
        "doc2 = nlp("Some gorgeous creatures are felines.")n",
        "doc3 = nlp("Dolphins are swimming mammals.")n",
        " n",
        "print(target.similarity(doc1))n",
        "print(target.similarity(doc2))n",
        "print(target.similarity(doc3))"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "0.8901766262114666n",
            "0.9115828449161616n",
            "0.7822956256736615n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}

Conclusion

Even though NLTK and other NLP libraries are great, spaCy is likely to emerge as a favored library because it shows amazing potential, especially for production-level tasks and applications.

As a recap. spaCy provides:

  • Fast performance and efficiency
  • Industrial-grade usage
  • Better accuracy
  • State-of-the-art algorithms for NLP tasks

spaCy keeps evolving and improving, which makes it more exciting to work with. I personally have fallen in love with spaCy and its capabilities.

If you are an NLP practitioner who hasn’t tried spaCy yet, you should definitely give it a try. I’m sure you will start loving it as well.

In this article, we barely scratched the surface of spaCy’s abilities. We can do much more with spaCy, and I plan to discuss these more advanced features and usages in a future article.

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email ([email protected]).

References

Avatar photo

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *