Machine learning, deep learning, neural networks, artificial intelligence. You can’t work a day in tech without coming across one of these buzzwords.
For a developer just looking to get started, it’s hard to wade through the jargon and ever-changing tools. This post is the first in a broader series we’re working on at Heartbeat on to help people get up to speed or stay current.
First, some disclaimers. I came to machine learning by way of physics and economics, not computer science. When it comes to software development, I’m self-taught. My language of choice is Python and the same is true for most machine learning specialists.
That said, there are a lot of tools out there in almost every language and you can build amazing things with all of them. Don’t @ me. Leave a comment and I’ll add more tools to the list.
Second, there are a lot of great startups building tools for this ecosystem (I run one of them), but this list is meant to feature ones you are most likely to run into on a daily basis.
Now it’s time for the good stuff. A glossary of machine learning tools as they stand in early 2019. I’ve organized this list from lowest level and most customizable, to highest level and most abstracted.
On your laptop
If you’re new to Python you’re going to need to get familiar with the Scientific Python stack. You’ll rely on these packages for data exploration, preprocessing, debugging, prototyping, and visualization.
Numpy is the foundation of scientific Python. It defines N-dimensional array objects and an entire world of methods to manipulate them. Everything from multiplying matrices to fourier transforms is included. The linear algebra is implemented in C / C++ and Fortran providing great performance. You’ll want to avoid writing loops by vectorizing your code, which can be a little tricky at first. Link
Whereas Numpy provides basic data structures and linear algebra operations, Scipy contains a broader set of scientific computing algorithms covering probability, statistics, optimization, integration, interpolation, and more. Again, C / C++ and Fortran implementations are under the hood. Link
Pandas is by far the most popular data exploration and manipulation package in Python. At its core is the concept of a DataFrame which acts as an in-memory relational database. You can do basic operations on columns and rows as well as aggregations on arbitrary groups. Basic time series routines are also included. Pandas is a bit of a memory hog, so be careful. Link
Matplotlib is the defacto plotting library for scientific Python. It contains a large number of predefined plotting types, but has low level APIs that make it possible to build GUIs if you wanted. Link
NetworkX is an easy to use, well supported network package. It has support for almost any type of graph and has methods for computing almost every graph metric under the sun. The one downside to NetworkX is that it’s implemented in pure Python and can be slow and a big memory hog. Link
Graph-tool is a faster, C++ based alternative, but isn’t as well supported and is a pain to install on some systems. Link
Jupyter notebooks are interactive documents you can use to write, execute, and share code. You can think of them as an interactive Python shell in your browser. All of the code and output is saved locally as json. Notebooks have the default way for data scientists and machine learning to share prototypes and examples. With GitHub now rendering notebooks natively, you’ll be sure to run into them. Link
General Machine Learning
Before neural networks came back into vogue, it was all about boring old machine learning. Support vector machines, random forests, even linear regression are still useful techniques you should try first before jumping straight into deep learning. Many of these tools will use Numpy and Scipy routines under the hood and all of them come with a bunch of algorithms right out of the box.
Sklearn is the most popular general machine learning package for Python. It contains high level APIs for training a huge variety of models such as linear regressions, GLMs, Random Forests, and SVMs just to name a few. There are also plenty of utilities to compute confusion matrices, ROC and AUC, and other statistics related to your model. Sklearn is supported by default on all of the major cloud-based machine learning platforms as well as Apple’s Core ML mobile friendly format. Link
XGBoost is an open source implementation of distributed Gradient Boosting on nearly every data science and big data platform. Gradient Boosting is an extremely powerful technique that combines multiple “weak” models into a large ensemble. Though each individual model may not be very accurate, combining the predictions of all models in the ensemble often produces great results that generalize well to broader sets of data you find in production. XGBoost has been used win more Kaggle competitions than any other modeling technique, but can be a bit of a black box. Link
Statsmodels is a high level API used to construct and fit statistical models like multivariate regression, GLMs, and more. Statsmodels will feel familiar to those who have defined multivariate linear models in R or used modeling software like SPSS or STATA. The library also has a bunch of useful utility functions for performing hypothesis testing, regression diagnostics, and specification tests. Statsmodels doesn’t have a converter to mobile formats like Core ML so if that’s your end destination, it’s best to use Sklearn. Link
Pomegranate is a new kid on the block providing a high level API for building probabilistic models from data. This includes things like mixture models, bayes classifiers, and Markov models. Link
Spark / Hadoop
Two years ago you couldn’t talk about machine learning without hearing about tools like Hadoop and Spark. Now, they barely get any mention at all as deep learning and neural networks dominate the hype cycle. Plenty of problems can be solved with traditional machine learning algorithms in Spark’s MLLib and you should absolutely consider one of these solutions before jumping straight to TensorFlow. Many companies are already using these tools to populate data warehouses and dashboards making it easier to get results than starting from scratch with another service to manage. Link
SystemML is a former IBM, now Apache project aimed at making it easier to create optimized machine learning and analytics pipelines. It comes with a bunch of ML algorithms out the box that can be used for classification or regression problems and has high level APIs to primitives that let users write their own.
SystemML uses Spark as a backend by default, but can be used with others like Hadoop. SystemML is most closely related to MLLib that is included with Spark, but with some additional abstractions and optimizations. Link
If you’ve just got to try the latest and greatest, deep learning is it. Convolutional neural networks have proven incredibly capable when it comes to tasks like image recognition, object detection, and language translation. There is no shortage of tools to get started with.
TensorFlow is one of the fastest growing and most popular open source software projects of all time. You’ve probably heard TensorFlow in association with neural networks and deep learning, but it is a general framework for executing numeric operations using data flow graphs.
In other words, you can perform almost any mathematical operations on matrix-like data. More practically, beginners should be aware that TensorFlow has a deferred execution model that can be tricky. You execute code in two parts: build your computation graph then feed data through it.
This can take some getting used to and it can make it hard to track down errors.
Keras is a high level API for building neural networks. Keras makes it simple to chain together various predefined layers or write your own. Actually computation is done using one of two supported backends: Keras or the now defunct Theano. If you’re just getting started with neural networks, don’t need to do anything crazy, and want to stay within the TensorFlow family, Keras is your best bet. Link
Caffe (1 and 2)
Caffe is a deep learning framework originally developed at Berkeley. It was an early favorite with some heavyweights like Facebook, but lately its lost some luster as other entrants like TensorFlow have gained popularity. Caffe APIs are available in C++ and Python. Facebook has continued development with Caffe 2, adding better support for distributed training and inference, newer hardware, and mobile deployment. When you’re looking through Caffe projects, be aware of the version number (1 or 2). While converters work fairly well, it’s not a guarantee that you can translate freely between the two. Link
PyTorch is a relative newcomer to the deep learning scene, celebrating it’s 2 year birthday in January, 2019. In just a year, though, it’s gained considerable traction and is the language of choice for popular deep learning courses like fast.ai. The biggest difference between PyTorch and other frameworks is that PyTorch is imperative. When you execute a line of code, it’s actually executed. Models don’t get compiled and evaluated later. Many programmers find this a lot more intuitive. As you may have guessed from it’s name, PyTorch is Python first and only (for now). Link
Theano is an older framework similar to TensorFlow in that it builds and executes arbitrary computation graphs for Tensors. It was one of the first packages that made transparent use of GPUs rather than having to write GPU calculations yourself. Shortly after releasing version 1.0 in late 2017, maintainers announced that active development would cease as it had become clear that TensorFlow had won. Theano is Python only. Link
Neon is a deep learning framework developed by Intel. It’s unclear if it offers anything at the API level that would warrant switching from something you are currently running on, but they offer a lot of optimization for Intel chips (big surprise) so if that’s a big concern for you, it might be worth checking out. Link
Turi Create is the highest level deep, unmanaged learning service available. Open sourced after Apple’s acquisition of its namesake, Turi Create makes it extremely easy to train custom models that do very specific tasks like image recognition. Turi Create is ten times easier to use than Keras which is ten times easier to use than raw TensorFlow. If you just want to get a model that does something, anything, for a project, start here. Turi Create only supports Python right now, but uses MXNet under the hood so you can leverage GPUs and port your model to other platforms later. Check out this guide on making a model to identify hotdogs and nothotdogs to see just how easy it is to use. Link
Released by Apple in the summer of 2018, Create ML is a high level Swift-based toolkit for training Core ML models directly in Xcode. A combination of notebook style programming via Swift playgrounds and a drag and drop GUI makes it very easy for beginners to get started. Models are mobile-first, designed to be used directly in apps. Link
Unmanaged, Distributed Clusters
All of the tools above this point can be used just fine on your laptop. Large deep learning models, though, can take days or weeks to train and chances are, you don’t want to leave your laptop running untouched for that long. When you’re ready to move things off of your laptop, you’ve got options. We’ll start with the least managed ones.
Bespoke — DIY Hardware
Due to the high cost of powerful GPU instance types on major cloud platforms, many have found it economical to build their own deep learning rigs. At the time I’m writing this, it costs between $2500 and $4000 to run the smallest GPU instance continuously for a year in the cloud. For that money, you can build an impressively fast machine yourself (crank up the settings on your video games at the same time). If you’re willing to fight the crypto miners for their precious GPUs, check out this guide for help picking the right one.
Bespoke — Cloud
Not too long ago, the only way to serve or train machine learning models in the cloud was to spin up an EC2 or Google Compute Engine instance and install everything by hand. This could get very complicated very quickly if you needed to handle extra hardware like GPUs and maintain queues to handle potentially long processing times. I still have a few cloud storage volumes saved with specific configurations of various frameworks, but hopefully you want ever need to go through the process of installing GPU drivers on a virtual machine ever again.
The container craze has finally made it to machine learning. From custom deep learning AMIs on AWS to Docker files you can deploy and scale with Kubernetes, nearly every major framework can be deployed this way. If you’re just prototyping on your laptop, I wouldn’t bother with the overhead, but if you have semi-regular training jobs or plan on serving machine learning powered APIs from the cloud, containers are definitely the way to go.
Management / Optimization
Unlike other types of programming, developing machine learning models comes with a lot less instant gratification. Your code might execute fine, but a bug in your math means that training is never going to converge on a good solution. It can take days or weeks of training before you can evaluate results so catching errors early is extremely important. There are a few tools that help you debug, track training progress, and model accuracy along the way.
Nvidia digits is a tool to help you design, train, and visualize deep learning. DIGITS lets you use most of the major deep learning frameworks and provides distributed GPU support out of the box. You can build networks, explore training data, track training progress, and evaluate resulting models all from a locally hosted web app. Like
Similar to DIGITS, TensorBoard is a tool provided by the TensorFlow team to help you visualize learning. You can track progress of model training, visualize the computation graph you’re executing, and visualize data as it flows through your code. Link
Horovod is an open source framework developed by Uber to optimize distributed training of TensorFlow models. Horovod achieves some impressive speedups by swapping out the message passing systems that TensorFlow uses to keep distributed computation in sync. Setup is as easy as wrapping TensorFlow optimizers and variables in Horovod objects. Link
Also from Uber, Michelangelo is an entire machine learning platform. It’s goal is to provide a scalable, robust way to go through every step of the development process from accessing raw data to monitoring models in production. It integrates with data warehouses and allows users to train models with almost all of the frameworks we’ve discussed above. Michelangelo also provides automatic visualization of features and models through a web interface. Link
Partially Managed Services
If you don’t want to tune and manage giant clusters, all of the major cloud providers now offer partially managed services to deal with things like auto-scaling and deployment for you. You still need to develop your algorithms locally, but training and running them in production on the cloud is taken care of.
Google Cloud ML
Google Cloud has come a long way. Cloud ML gives you access to Google Cloud resources specifically configured for machine learning workloads. You can grab data from Google Cloud Storage, use it train models using TensorFlow and GPU instances, then deploy the finished product to compute engine. Google even provides a nice set of command line tools to help you get models you’ve prototyped locally up into the cloud for more intensive workloads. Because it’s Google, only TensorFlow is supported when it comes to deep learning, but Sklearn is available for other model types. Google offers a product called Datalab which functions like a Jupyter Notebook running remotely inside Google Cloud ML. Link
Azure ML Studio
Microsoft’s Azure ML Studio has a very specific take on machine learning. They’ve created a visual interface for training and serving models. Every step of the process, from data transformation to prediction, is constructed by connecting different tasks in a visual computation graph.
Individual operations like “select these columns from the database” and “run k-means clustering” are defined by the user and the system deals with the backend computation. You can choose from over 100 algorithms out of the box or insert your own arbitrary scripts.
Beware, though developing models with Azure ML studio is an entirely different workflow than all the other ML frameworks. Right now, there is no support for deep learning, but Microsoft has an Azure Batch AI product in beta. Link
SageMaker is one of the newest services available in the AWS family. It competes directly with Google Cloud ML to help you train, deploy, and manage machine learning models in the cloud. You get hosted Jupyter notebooks and can pipe data directly from S3 into predefined algorithms or your own custom defined model in any framework. Amazon claims that all of their algorithms have been optimized to run on AWS. SageMaker takes care of setting up clusters, auto-scaling them, and running them in production. As of 2019, SageMaker also has a tool called Neo, which optimized ML models specifically for mobile runtimes. Link
Check out the best Amazon SageMaker Alternatives here.
Managed Hadoop / Spark
If you’re not doing deep learning and are using Hadoop or Spark for analytics and standard ML, Amazon’s EMR and Googles Dataproc services are easy ways to get more resources for your calculations. You can prototype locally, then use command line tools to deploy tasks to the cloud. The cloud providers takes care of setup and teardown for you and you only pay for servers while the job is running. These products are best used for batch jobs and won’t help you when serving online inferences with trained models.
If you want tools built by a company that does machine learning and only machine learning, H2O might be a good option for you. They provide web based tools much like AWS SageMaker and Azure ML Studio to help you manage the entire machine learning process from data exploration to model selection and training to deployment. Link
Fully Managed Services
As you can probably tell from this post, machine learning is still really complicated. Automatic machine learning promises to make things easier. The idea is to give a compute all of the data you have and what you want to predict with it, and let the machine find the best model and architecture for the task, no coding required.
Google got into a bit of PR trouble for the name, but AutoML is their first attempt at automated machine learning. Point the service at a bucket of labeled images, and it will automatically train and deploy a model for the task using a combination of transfer learning and automated architecture search. You don’t need to prototype or program a thing. Link
Microsoft Custom Vision Service
Microsoft is testing a new service that allow anyone to upload labeled images and create a custom neural network trained to classify them. Models can be served from an API created in Azure or exported for use somewhere else, including mobile via Core ML. Link
IBM Watson Studio
IBM Watson Studio is a new set of tools released by IBM to help data scientists build and deploy models with as little code as possible. You upload your data to the IBM cloud, select various features and things you are trying to predict, and Watson Studio will automatically attempt to build a model for that task. You can then deploy the cloud and receive an API or export certain models in mobile friendly formats like Core ML. Link
DataRobot’s approach to automatic machine learning involves some brute force. Point it to data, pick what you want to predict and features you’ll be using, and DataRobot automatically tries hundreds of different algorithms and parameters to automatically find the best performing algorithms. Once you’ve chosen the model you want, DataRobot will build and manage an API endpoint for you. Link
Not wanting to get left out of the AI game, Salesforce offers Einstein, an AI tool that claims to automatically generate insights about your business and customers from your Salesforce data. This is really just a feature of Salesforce, but it’s one of the few completely automated business insight tools available. Link
Pre-trained Managed APIs
The problem you’re tackling may have already been solved for you. Things like general image recognition with thousands of object categories, language translation, and voice recognition are all offered as APIs by the major cloud providers.
Thanks to the enormous amount of data it owns, Google has been winning the AI war from the start. They give you access to some of their best models through web APIs in Google Cloud. Google Cloud Vision lets you send images and receive almost a dozen attributes from object labels to logo detection to OCR. Cloud Speech API converts speech to text. Translation converts text from one language to another. Cloud Natural Language will help you parse unstructured text, and Cloud Video Intelligence will even create metadata from videos. Depending on the exact thing you want to use services for, it’s best to Google around for head to head comparisons as there can be quite large differences between providers. For example, just because one provider has the best label detection doesn’t mean it will provide the best face detection.
Amazon’s flagship API is Lex, a conversational UI builder. Lex offers automatic voice recognition and natural language processing so you can build services like Amazon’s Alexa. AWS also offers Rekognition for image and vision, Polly for text to speech, Transcribe for transcription of large audio files, and Comprehend for natural language processing.
Algorithmia takes a broad approach to fully managed machine learning APIs. Acting like the GitHub for hosted machine learning models, they allow anyone to upload a pre-trained model and access it through an API endpoint. Individual developers or organizations can then charge per request to use the model. Public and private endpoints are available. Link
Clarifai was one of the first startups to offer image analysis APIs. They have a number of proprietary models with excellent results for specific image recognition tasks like identifying apparel, celebrities, or food in photos. If you’re trying to do something slightly more specific than what Google or AWS will allow, check Clarifai before building from scratch. They recently announced a mobile SDK as well. Link
The cloud is definitely the way to go for training your models, but there are plenty of great reasons to deploy your models directly on to mobile devices. You cut out network latency meaning things just run faster, providing a better user experience.
You save money by not having to run expensive GPU instances in the cloud. And your uses data remains safely on the device itself. This is one of the newest areas of machine learning and things can change fast.
One final point of note, most of the machine “learning” on mobile isn’t actually “learning”. Pre-trained models are simply being used on new data gathered by mobile apps. There are a few research projects for on-device training, but we aren’t there yet.
Apple gives you a few models right out of the box including their Vision Framework which has face detection, text detection (just bounding boxes not OCR), barcode detection and object tracking. ARKit also provides computer vision models that can detection horizontal surfaces and track position in 3D space. Link
Android has a very similar set of computer vision functions available through their Mobile Vision API. They don’t offer anything in the way of natural language processing. ARCore is Android’s answer to ARKit and has absorbed the software side of the now defunct Tango project.
OpenCV is one of the most mature, well supported computer vision frameworks. Both iOS and Android have been supported for years and there are tons of algorithms already implemented including the latest neural networks. I was really surprised at the breadth of implementations. Link
TensorFlow Lite / Mobile
There are two mobile friendly flavors of TensorFlow. The first is TensorFlow Mobile a version of TensorFlow that compiles with mobile in mind. Android or iOS are both supported. TensorFlow Mobile will get you up and running, but it’s only capable of utilizing your phone’s CPU and will ignore newer AI specific hardware. As of Android 8.1, Google released TensorFlow Lite as an answer to Apple’s Core ML. TFLite supports a smaller set of operations but takes advantage of hardware accelerated ML APIs opened up by the operating system.
Core ML / Metal
Core ML was announced by Apple at WWDC ’17. It is a specification for trained models that all Mac devices can parse and compile into hardware accelerated machine learning code. Converters now exist to transform models from most frameworks (e.g. TensorFlow or Caffe) into Core ML files which can be added to your iOS project. Metal is Apple’s framework for interacting directly with GPUs on devices. The most recent version of Core ML allows developers to write custom operations that leverage all available hardware making it possible to implement almost any model if you’re willing to dig into low level GPU code.
With Caffe2, Facebook added support for integration with both iOS and Android. Surprisingly, Facebook reports that their optimized CPU implementation actually outperforms older GPU frameworks on devices like the iPhone 6s. I haven’t seen any direct comparisons with Core ML and iOS 11. Link
Accelerate was the original was of doing hardware accelerated anything on iOS or macOS. There are low level functions for building and evaluating neural networks and you may run into apps that still use these APIs as there isn’t an easy way to switch over to Core ML. Link
Every framework has its own format for importing and exporting models. Converting between them is still a bit of a chore, but support is getting better.
- Core ML (.mlmodel): A protobuf that gets compiled into an .mlmodelc file for execution in your app.
- TensorFlow (.pb): A plain protobuf file describing the TensorFlow compute graph and any trained parameters.
- TensorFlow Lite (.tflite): For TensorFlow Lite Google created another storage format called a FlatBuffer which saves extra space over protobufs.
- MXNet (json): We all know and love json.
- Microsoft MMdnn(.pb, .npy, json): An intermediate format from Microsoft used for translating between frameworks.
- Caffe ( .caffeemodel): A binary proto file. Caffe can alost export to HDF5 files as well.
- Onnx (.onnx): An open source neural network intermediate format maintained by a collaboration between Microsoft and Facebook.
Machine learning models all use a very specific set of linear algebra operations. To improve speed and performance, many frameworks come with compilers specifically designed to optimize these functions. If you’ve got a model and speed is holding you back, there are a few options.
XLA is an experimental accelerated linear algebra compiler build by the TensorFlow team. It allows for device specific implementations of various tensor operations so that execution is optimized for each devices computation and memory constraints. Link
NNVM is another open source compiler released by Amazon. It is now the default compiler for MXNet, and was responsible for speeding up neural networks on small devices like the Raspberry Pi by 10X. Link
NGraph is Intel Nervana’s compiler for optimizing deep learning frameworks to run on Intel chips. It can be used as a backend for TensorFlow and Intel’s own framework, Neon. Link
OpenAI, Elon Musks AI initiative, has open sourced some fairly unique tools that do a good job of staying platform agnostic. OpenAI Gym helps developing reinforcement learning algorithms, RoboSchool is a simulation framework for learning control algorithms for Robots, and Universe contains tools for training and measuring the performance of AI agents on extremely complex, real-world tasks.
Sources of data to actually train your algorithms will be coming in another post.
I hope this helps get you started with machine learning. If I missed your favorite tool or got something wrong, comment below!