As someone who has designed neural networks destined for real-time use on mobile devices, state-of-the-art NLP models like transformers have always felt off limits.
Standard configurations for a model like BERT can easily tip the scales at over 1GB in size with more extreme versions 10 to 100 times larger than that. Recent advances in distillation and quantization, however, motivated me to reconsider my assumptions that NLP was something only accessible with multiple high-end GPUs.
In this post, I’ll show you how I built a Slackbot that can identify toxic and hateful messages in under 300ms with a 90MB transformer model built with HuggingFace and running in an inexpensive AWS Lambda function.
Successful machine learning projects always start with a well-defined task. In this case, my goal was to build a Slackbot which can detect toxic or hateful speech in real-time.
Slack users would be able to trigger the bot by adding a reaction to a message (Slack users are able to tag messages with emojis as a way of reacting to it) and the bot will post a determination of whether or not the message is toxic.
This can be framed as a classification task where a model takes in a sequence of words and outputs a binary classification (toxic speech or not). Other constraints for this task include a 100MB size limit for the model and latency of less than 300ms to ensure it feels like it’s running in “real time.”
Toxic and hateful speech detection is a very hot topic in NLP research. Thankfully, this has led to a large number of open source datasets in multiple languages. You can browse some of them here.
I decided to focus on English language datasets scraped from social networking sites like Twitter as I believe they more closely resemble the vocabulary and structure of Slack messages.
Selecting a Model
Transformer models have sparked a revolution in natural language processing, demonstrating incredible accuracy on a variety of tasks including text classification. Since the transformer architecture was introduced in the now-famous Attention is All You Need paper in 2017, dozens of different transformer models have been proposed.
For this project, I chose to use a model from the BERT family as my classification task would benefit from bidirectional pre-training that looks at the entire context of input sequences. This contrasts with generative tasks that are always uni-directional, trying to predict the next word or sentence. Specifically, I chose a RoBERTa model which is just a BERT model with optimized hyperparameters and pre-training tasks.
I decided to use HuggingFace’s exceptional libraries for data processing pipelines, transformer implementations (available in both PyTorch and TensorFlow), and model training. Before going through the process of preparing data and training a model, I decided to check the HuggingFace Model Hub to see if someone had already trained a model suitable for my task.
It turns out, someone had and, in the end, I didn’t need to do any training at all. Instead, I could focus my energy on optimizing an existing model to meet my size and latency constraints.
One final note is that data pre-processing in NLP is highly dependent on the model architecture. Input sequences are “tokenized” (converted to sequences of numbers that can be used in mathematical operations) before being fed into our model and the tokenizer used must match the model architecture. Again, HuggingFace makes all of this simple, allowing us to load both the tokenizer and model with just a few lines of code:
Optimization 1: Distillation
Distillation is a model compression technique where the knowledge from a larger model is passed down to a smaller model via a student-teacher training process. Input data is fed into both models, large and small, and the smaller model is trained using a special loss function that includes terms related to how well it predicts the correct answer and how well its output matches that of the larger model.
You can read more about distillation here. Incredibly, large models like BERT can be distilled to just 40% the original size while retaining 97% of the accuracy.
While HuggingFace provides tools that make it easy to distill a large language model, the pre-trained checkpoint I found in the previous section had already been distilled.
The DistilRoBERTa model checkpoint weighed ~330MB, a considerable amount less than the 1GB original RoBERTa model, but still three times larger than my 100M constraint.
To understand run-time latency, I wrote a quick benchmarking function to test multiple model configurations and runtimes:
In terms of run-time speed, I benchmarked this model at ~47ms for a single sequence containing 10 words. Pretty fast, but may not be enough to hit my latency target for larger sequences and accounting for round-trip requests.
Optimization 2: Quantization
During training, most neural network weights are stored as 32-bit or even 64-bit floating point numbers. This is way more precision than we actually need. We can save considerable space and speed up execution by getting rid of some of these unnecessary decimal places. This process is known as quantization.
There are a number of different precisions we can quantize to (32-bit floats, 16-bit floats, 8-bit ints, etc.) and the best choice depends on our accuracy requirements and what type of hardware our final model will run on. Most GPUs do all of their calculations on fixed precision floating point numbers so quantizing our model to integers will make it smaller, but not necessarily faster. CPUs, on the other hand, often have optimized instructions that make integer operations much, much faster at the price of model accuracy.
For this project, I wanted the final model to run on inexpensive CPUs so I decided to perform 8-bit, post-training quantization. (Note: it is possible to buy back some lost accuracy by making training quantization-aware, but this would require retaining the model from scratch so I decided against it).
Quantizing a PyTorch transformer model now only requires a single line of code:
That line alone reduced the CPU runtime of my model from 47ms to 23ms, a reduction of ~50% with little noticeable loss in accuracy. It also reduced the size of the model to just 91MB, 10% under my 100MB constraint.
Optimization 3: ONNX Runtime
While PyTorch and TensorFlow are the two most popular runtimes, they are not the only ones. ONNX is an open-source runtime primarily maintained by Microsoft that has made major strides in model performance. By switching to ONNX, I was able to leverage architecture optimizations like fusing layers together to reduce the total operations required as well as an optimized quantization inference engine.
HuggingFace provides a conversion tool to create an ONNX model from a model checkpoint. You will then need to install the onnx and onnxruntime packages to use it.
With an ONNX checkpoint in hand, we can optimize certain operations using the onnxruntime library.
The optimized ONNX model can then be quantized with:
Using the ONNX runtime with an optimized, quantized model resulted in an inference speed of just 17ms, nearly 4X faster than the full-precision model. This model weighs 84MB, just 25% the size of the distilled version and 9% of the size of the original model.
Out of curiosity, I ended up benchmarking this model under many different runtime and hardware configurations inside of Google Colab. You can see the results below. All configurations were tested with a batch size of 1 and a sequence length of 10. They roughly conform to HuggingFace’s official benchmarks.
Building the Bot on AWS Lambda
With our optimized model in hand, it’s time to get it into production. We want the bot to be triggered whenever someone reacts to a Slack message with a particular emoji. I won’t go into detail on Slack’s API (you can read more here), but the event API has a `reaction_added` event for exactly this purpose. To summarize the flow:
- A user reacts to a message in Slack.
- Slack fires a POST request to a URL we provide with data on the reaction (what emoji it was, who sent it, in which channel, etc.)
- Our API will parse this event and, if the react was the one specific to our bot, will query Slack for the text of the message and run it through our model.
- The bot will post a message in Slack with a determination on whether the text is toxic or not based on some score threshold.
You’ll notice Step 3 requires us to build an API that will receive events from Slack. Rather than manage my own web server, I decided to use AWS Lambda and API Gateway to take care of a lot of the boilerplate for me.
Here is how the lambda function looks:
There are some helper functions that warrant a deeper dive. First is the `_get_message` function. Slack’s event API doesn’t actually provide the text of the message that the react was added to. This means we have to query Slack for it. Luckily, their Python SDK provides this functionality. The “key” for each message is the unique timestamp and channel.
The next custom function is `_classify_message,` which actually runs the model. The meat of the function is below:
Let’s break this down. The first couple of lines of the `_classify_message` function load the tokenizer and the ONNX model itself with the help of some global constants defined in a configuration file. Code for the `create_model_for_provider` function can be found here. At the top, I’ve defined a class to store the model results and make them more convenient to access. Note, we imply a threshold of 0.5 for our `is_toxic` property.
Finally, the classify function takes the message string, tokenizes it, then runs it through the model before performing softmax on the output (the ONNX model converter does not include the softmax operation on the model output).
Back in our lambda handler, we build a response based on the result of the model and again use the Slack Python SDK to send a message as Toxic Bot to the channel the message was in.
That’s the basics of our system! Putting it all together we get:
Actually deploying this function to AWS Lambda requires a bit of devops work that I won’t get into here, but I will provide a few tips below and encourage you to work through this tutorial.
- AWS Lambda functions can be uploaded in two ways: via a zip file or via a custom container (Docker image). The zip file method has a 250MB size limit which is nowhere near enough to hold the ~4GB of dependencies (PyTorch, numpy, scipy, etc.) along with the model weights themselves. Your best bet is to create a container image with all of the dependencies and model weights baked in.
- When you are creating your container, make two layers that build on each other. The first layer should start with some standard Python image and contain all of your dependencies and your model weights. This will be a huge image (~4GB). The second one should build from the first as a base and contain just the application code we wrote above. Every time you need to update your lambda function, you need to build and upload the image. Without separating your dependencies from the application, this means building and uploading 4GB images for every little change. Not fun. Split them up and you only need to update the application image (unless requirements or model weights change).
- Slack’s Event API waits three seconds for your API to return a response. If it doesn’t receive a 200 by then, it will retry sending you the event up to three times. This can cause duplicate messages to appear.
Measuring Overall Performance
As I mentioned at the beginning, I wanted the total latency of the system to run in ~300ms. AWS Lambda provides metrics on runtime of each event. Events lower down occur before those higher up.
Reading this, it seems like there is some considerable latency while the function is warming up (~18s) and then about a ~1s response time after that, slower than our goal. What to do?
You may have noticed in our inference code we loaded the model and the tokenizer on every inference. That’s an 85MB model and a 50,000 word vocabulary each time the model is run. If we controlled the entire server, we would probably want to load the model once and use that same instance to serve multiple requests. By design, though, AWS Lambda functions should be stateless. They are ephemeral compute kernels that get created and torn down on each request. However, the engineers at Amazon have smartly built in automatic caching of functions so that kernels can be reused if multiple requests come in fast enough.
We can exploit this by moving the tokenizer and model initialization outside of the classify function:
Now, the models are only loaded each time a function is called for the first time in a while. Looking at the data, requests are now returned in just ~200ms, well below our goal.
This system design isn’t perfect, but we’ve achieved our goals of running a sub 100MB transformer model in less than 200ms in most cases.
To summarize, I built a Slackbot that can identify toxic and hateful messages. I used a pre-trained distilled RoBERTa model checkpoint from the HuggingFace Model Hub and applied optimizations, quantization, and conversion to the ONNX runtime to reduce the model size by 75% and speed up runtime on a CPU by 4X.
I then deployed an AWS Lambda function to handle requests to classify Slack messages and leveraged some clever caching to reduce latency by 80% and achieve my overall performance goals of 300ms response times.