As most ML practitioners realize, developing a predictive model in Jupyter Notebook and making the predictions with excel data may not help you build the predictive models required at enterprise scale. To build the model at such a scale, you will need to consider several requirements and use various tools/frameworks that are especially designed to meet the purpose of this expansion.
Most of the tutorials online speak about the productionization of ML models as exposing them as a REST service with the help of Apache Flask. But in reality, the requirements are much more steep, and in this article I will be explaining the key challenges to consider and then will provide you with a containerized, enterprise scale, ‘fully loaded’ ML jumpstart kit that you can readily deploy towards your model productionization purposes.
In this article, I will explain in detail the key challenges to consider while productionizing prediction models and will show you how to setup an environment with docker containers. So, I assume the readers of this article are familiar with docker and its commands.
Also, you will be required to have docker installed in your system.
The entire jumpstart kit with the code is available here and you can refer to it as needed.
Now, let’s get started!
Challenges to Consider and How We are Addressing Them
Challenge 1 — Production Grade App Server
First thing’s first, Apache Flask is not a web server/app server and it is not intended for production usage on its own, according to official documents. So, exposing a model as a REST service with Flask may not be the right choice. If you want to use Apache Flask, we need to use it along with Gunicorn and NGINX as a WSGI app server and web server, respectively, to meet the production needs.
How are we addressing this in our jumpstart kit?
A new Python micro-framework, FastAPI, has recently launched and claims that it is more 300% faster than Flask. Apart from performance, it has several advantages such as it’s easy to learn (very similar to Flask), standard-based, robust, and has asynchronous execution, etc. Refer to this blog for additional details on the benefits of FastAPI over Flask:
FastAPI inherently works with Uvicorn — an ASGI based app server — and supports an asynchronous mode of execution, which is ideal for long-running ML based tasks.
Challenge 2 — Loosely Coupled Execution
Typical machine learning processes are long-running tasks such as model training, data processing, prediction (in some cases), etc., and exposing them directly as a web service may not help you in scaling up. Consider you have exposed your model as a service and during the peak time of your business, hundreds of calls are being made to your service. In this case, your server may not be equipped to handle all the requests and ultimately your server could end up crashing out.
A better way to handle this scenario is by using a ‘message queue’ to queue the incoming requests and address them in an asynchronous way with loosely coupled worker processes.
As shown below, every call to the REST API will be queued in a message queue and the consumer consumes the messages one-by-one to execute the task. We can increase the consumers as needed and the system will not be overloaded at any point.
If we consider queue-based execution, Redis is one famous, simple option. It is an in-memory key-value database for storing messages. But still, it is not ideal when considering disaster recovery aspects.
Consider you have messages queued in Redis and suddenly your server goes down — how do you retrieve the messages and serve the already placed requests upon recovery?
Also, assume you have messages in your queue and one of the workers has taken it for processing — suddenly the worker goes unstable. How does the master know whether the task is complete or not? Is there any acknowledgment from the worker about the task being done?
How are we addressing this challenge in our jumpstart kit?
RabbitMQ is one of the most widely used messaging brokers that has several features to address the challenges mentioned above. It supports options of declaring queues as ‘durable’ and messages as ‘persistent’ — which helps in retrieving the messages upon disaster recovery. It also has a wonderful management console that helps in viewing the details of the queues, status of workers, etc.
Challenge 3 — Setting Up an Environment
Setting up an environment and taking care of the right software versions are indeed a big task.
How are we addressing it in our jumpstart kit?
The entire jumpstart kit has been dockerized with docker-compose and you will be required just to execute a single command to spin up an environment.
Enterprise-Grade ML Jumpstart Kit:
We have discussed in detail the existing challenges and how we will be addressing them in our jumpstart kit. Now, let’s dive deep into the architecture of our kit.
As mentioned, we will be dockerizing our entire kit and the following are the docker components we will have:
Docker 1 – Web Server:
A web server is the main component for us with the components of Uvicorn + FastAPI + RabbitMQ client. We will be exposing the API, which will act as an entry point.
Once we call the API with the required data, it will store the data in a database and generate the unique ID for the record.
A unique ID will be sent to RabbitMQ queue as a payload.
Then, RabbitMQ queue will be declared as ‘durable’ and messages will be sent as ‘persistent’ to ensure the recovery of the data in case of disaster.
Finally, it responds back to the API call with the unique ID as a token for the later retrieval of prediction results.
Docker 2 – RabbitMQ:
RabbitMQ will play the role of message queue server and it will queue the incoming requests.
Refer to the below article to understand about ‘RabbitMQ Management Console’ which is an excellent UI interface for monitoring the status of queues and messages.
Docker 3 – MongoDB:
MongoDB will act as a database for us, but we can use any other database in its place. The reason behind storing the data in a database is to persist the data and to retrieve the prediction results later.
Docker 4 – Model Server:
Model server is the working component that does all the heavy lifting for us.
Multiple replicas of this component can be created and each of them will poll the queue looking out for the message and take them for processing.
This component will have access to the model and once it retrieves the unique ID from the queue, it queries the data from the database and feeds it to the model for prediction.
This component can be utilized to perform training, data-preprocessing, prediction, etc., but in the given example, the model predicts the sales forecast of the upcoming month based on past sales that have been deployed. Please refer to the link below for additional details about the model:
Once the prediction is done, results will be written back to the database.
At the end of the task, acknowledgement will be sent back to the queue indicating the task is done.
If, for some reason, the consumer dies, RabbitMQ will not receive the acknowledgement and it will send the same task to a different worker for processing.
All the individual docker components will be executed as a service in a common network using the docker-compose feature. The file is available below:
Key points to note in the docker-compose file are:
- Have defined each of the docker components as a service.
- All the docker services are defined in common network.
- WebServer and ModelServer services are defined to be dependent on RabbitMQ and DBServer.
- RabbitMQ will take 15–20 seconds to set up. If we try to connect to RabbitMQ before that, it will result in an error. So, I have added the restart policy for WebServer and ModelServer on failure.
So, we are all set — fire the below command and let the services spin up for you:
Now, the instances will fire up and the API will be available for your consumption.
Launch ‘http://localhost:8080/docs/’ in your machine to look for the below page with the details of the API.
Now get the ‘tracking_id’ from the response and hit the ‘prediction_result’ api to retrieve the prediction value — as below:
Scaling Up the Server
It’s good that we have set the functioning ML environment with all the required components, but we are still missing one critical need, which is about scalability. Consider, you have deployed the ML solution and suddenly there is a surge in the number of people accessing your prediction service. In this case, you would like to scale the ML environment to multiple workers so that the messages present in the queues will be processed by multiple workers and you can scale up the system.
Achieving this in the jumpstart kit is a breeze and we will be leveraging ‘Docker Swarm’ for this.
Enable ‘Docker Swarm’ in docker with the following command:
Then execute the below command to build the docker stack and all the docker services will be created.
Check the status of docker services with the below command:
Now, as we discussed earlier, you would like to increase the number of worker components to ‘4,’ and you can do that with the execution of the below command:
Now, check the status of docker services with the below command:
As you see above, you have just spun-up 4 instances of model servers and now your tasks in the queue will be served by 4 workers.
Thus, we have come to the end of this article, where we discussed in detail the key challenges to consider while productionizing ML models for enterprise needs.
We have discussed the enterprise scale ML jumpstart kit comprising FastAPI+RabbitMQ + MongoDB + Docker + Docker Swarm. This kit is super robust and will be scalable to meet your enterprise needs.
Please let me know if you have any queries/suggestions with regard to the jumpstart kit.