What a year for natural language processing! We’ve seen great improvement in terms of accuracy and learning speed, and more importantly, large networks are now more accessible thanks to Hugging Face and their wonderful Transformers library, which provides a high-level API to work with BERT, GPT, and many more language model variants.
With the advent of attention-based networks like BERT and GPT, and the famous word embedding tokenizer introduced by Wu et al. (2016), we saw a small revolution in the world of NLP that actually disrupted the general representation of words in order to extend the use of more advanced neural networks.
Tokenization and normalization
Before we can start modeling or using any advanced neural network, we need to go through two important steps — tokenization and normalization.
A tokenizer is a tool that performs segmentation work. It cuts text into tags, called tokens. Each token corresponds to a linguistically unique and easily-manipulated label. Tokens are language dependent and are part of a process to normalize the input text to better manipulate it and extract its meaning later in the training process.
When you have a dataset, you’re never 100% sure that the text is clean and normalized. Using a good tokenizer ensures that the text that will get fed to the network is clean and safe.
In some cases, it becomes too difficult to capture meaningful units with just a few rules (especially vocabulary, for example), so a learning approach can be used. An annotated corpus can make it possible to learn the particular tokens to better extend their circulation to all incoming texts.
Thus, using tokenizers pre-trained on large datasets of compound and rare words makes it possible to avoid incorrectly splitting words—for example, words like “Bow tie” or “Father-in-law”.
When building neural networks, you have to choose what kind of data the network will be trained on. Most of the time, existing tokenizers will do the job, but in some cases you want to have the freedom to create your own tokenizer from your own dataset, or maybe your own technique of splitting words. That’s where Hugging Face’s new tokenizer library comes in handy.
Hugging Face’s Tokenizer Library
The Hugging Face team chose to write the library in pure Rust. Smart move from them, as Python isn’t known for its speed. It’s also surprisingly very bold since they could have used C or C++.
But since the ML community is in love with Python, and it’s still the king of the field, they managed to create wrappers that bind the Rust version.
Setting up the environment
- Create a directory and cd into it:
- Create a virtual environment:
- Install Rust [MacOS and Linux]:
- Install the package using pip :
After installing the package, run pip freeze. You should see all installed packages and their versions:
We’re all set now.
Let’s start by importing the main tokenizers that are already implemented by the package and instantiate a sentence that we’ll use as our main testing input:
- Import functions:
- Instantiate a sentence:
- Download various vocabularies in the current directory using the terminal:
- Start tokenizing:
Here’s the expected outcome:
Each and every word is tokenized and gets a unique id and an offset :
- id: unique identifier for each token
- offset: starting and ending point in a sentence
You can also train your own tokenizer and even create your own—that’s the level of versatility the package provides to researcher and engineers.
Vive Hugging Face 🤗 and NLP!
In the last two years, I’ve been more focused on image processing and convolutional neural networks (CNNS). But I think 2020 will be the year of democratization of powerful NLP tools like Hugging Face’s Tokenizer library and many more.
Tokenizers give us a starting point to experiment and explore new techniques in word embedding and tokenizing. Perhaps something I haven’t had the chance to emphasize enough is how the Hugging Face team has managed to create something simple and incredibly fast.
To illustrate this, Steven van de Graaf does a great job of comparing performance metrics:
Now that you can easily use either existing tokenizers or create your own, you can start exploring transformers and build your own network—maybe even the next BERT or GPT-2.
You can also perform various tasks using massive networks implemented in the Transformers library, such as:
- Text generation
- Classification (Topic, sentiment analysis, etc.)
- Predict if a sentence is a continuation of another
- Question and answer tasks
Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at [email protected].