The 5 Deep Learning Frameworks Every Serious Machine Learner Should Be Familiar With

Deep Learning requires a lot of computations. It typically involves neural network(s) with many nodes, and every node has many connections — which must be updated constantly during the learning. In other words, at each layer of the network, hundreds and thousands of identical artificial neurons perform the same computation.

Therefore, the structure of a neural network fits very well with the kinds of computation that a GPU (Graphic Processing Unit) can efficiently perform — which are designed to compute, in parallel, the same instructions.

As the deep learning and AI fields have been moving extremely fast in the last few years, we’ve also seen the introduction of many deep learning frameworks. Deep learning frameworks are created with the goal to run deep learning systems efficiently on GPUs. They all rely on the concept of computational graphs — which define the order of computations that need to be performed.

What you have in these frameworks is a language that sets up the computational graph and an execution mechanism that’s different from the host language. Then the graph can be optimized and run, in parallel, in the target GPU.

In this post, I want to introduce to you the 5 frameworks that are the workhorses of deep learning development. They make it easier for data scientists and engineers to build deep learning solutions for complex problems and perform tasks of greater sophistication. These are just a small selection of a wide range of open-source frameworks, backed by different tech giants, which push each other to innovate faster.

1 — TensorFlow (Google)

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team. Its purpose is geared towards deep neural networks and machine intelligence research. The library has officially been open-sourced on GitHub since late 2015. TensorFlow is extremely useful to do graph-based computations quickly. The flexible TensorFlow API can deploy models across multiple devices with its GPU-supported architecture.

In brief, the TensorFlow ecosystem has 3 main components:

The TensorFlow API, written in C++, contains the API to define the models and train the models with data. It also has a user-friendly Python interface.
TensorBoard is a visualization toolkit to help with analyzing, visualizing, and debugging TensorFlow graphs.
TensorFlow Serving is a flexible, high-performance serving system used to deploy pre-trained machine learning models in production. Also written in C++ and accessible with a Python interface, Serving is able to switch from old to new models instantaneously.

TensorFlow has been used widely in academic research and industrial applications. Some notable current uses include Deep Speech, RankBrain, SmartReply, and On-Device Computer Vision. You can check out some of the best official uses, research models, samples, and tutorials of TensorFlow at this GitHub repo.

Let’s take a look at a running example. Here I train a 2-layer ReLU network on random data with L2 loss on TensorFlow.

import numpy as np
import tensorflow as tf

# Batch size = 32, Input Dimension = 500, Hidden Dimension = 50

# Define computational graph
# Create placeholders
x = tf.placeholder(tf.float32, shape=(32, 500))
y = tf.placeholder(tf.float32, shape=(32, 500))
w1 = tf.placeholer(tf.float32, shape=(500, 50))
w2 = tf.placeholder(tf.float32, shape=(50, 500))

# Forward pass
h = tf.maximum(tf.matmul(x, w1), 0)
y_pred = tf.matmul(h, w2)
diff = y_pred - y

# Use predefined common losses
loss = tf.losses.mean_squared_error(y_pred, y)

# Compute loss of gradient
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Run the graph multiple times
with tf.Session() as sess:
  # Create numpy arrays
  values = {x: np.random.randn(32, 500),
            w1: np.random.randn(500, 50),
            w2: np.random.randn(50, 500),
            y: np.random.randn(32, 500),}
  
  # Train the network
  learning_rate = 1e-5
  # Run the graph in a loop
  for t in range(20):
    out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
    loss_val, grad_w1_val, grad_w2_val = out
    values[w1] -= learning_rate * grad_w1_val
    values[w2] -= learning_rate * grad_w2_val

There are 2 main components of this code: defining computational graph and running this graph many times. While defining the graph, I create placeholders for input x, weights w1 and w2, and targets y. Then during the forward pass, I compute prediction for target y and the loss variable, which is the L2 distance between y and y_pred.

Lastly, I tell TensorFlow to compute loss of gradient with respect to w1 and w2. After getting done building the graph, I enter a session to run the graph. Here I create numpy arrays that will fill in the placeholders above, feeding them for x, y, w1, w2. In order to train the network, I run the graph over and over, using gradient to update the weights and getting numpy arrays for loss, grad_w1, and grad_w2.

Keras: The High-Level Wrapper

Deep Learning frameworks operate at 2 levels of abstractions: Low Level — where mathematical operations and neural network primitives are implemented (TensorFlow, Theano, PyTorch etc.) and High Level — where low level primitives are used to implement neural network abstractions, such as models and layers (Keras).

Keras is a wrapper over its backend libraries, which can be TensorFlow or Theano — meaning that if you’re using Keras with TensorFlow backend, you’re running TensorFlow code. Keras takes care a lot of the nitty-gritty details for you, as it’s geared towards neural network technology consumers and is well suited for those practicing data science. It allows for easy and fast prototyping, supports multiple neural network architectures, and runs seamlessly on CPU/GPU.

import keras
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD

# Batch size = 32, Input Dimension = 500, Hidden Dimension = 50

# Create the model
model = Sequential()
model.add(Dense(input_dim=500, output_dim=50))
model.add(Activation('relu'))
model.add(Dense(input_dim=50, output_dim=500))

# Define optimizer object
optimizer = SGD(lr=1e0)

# Compile the model
model.compile(loss='mean_squared_error', optimizer=optimizer)

# Randomize data
x = np.random.randn(32, 500)
y = np.random.randn(32, 500)

# Fit the model
model.fit(x, y, epochs=50, batch_size=64, verbose=0)

In this example doing similar neural network training as above, I first define the model object as a sequence of layers, then define the optimizer object. Next, I build the model, specify loss function, and train the model with a single ‘fit’ line.

2 — Theano (University of Montreal)

Theano is another Python library for fast numerical computation that can be run on the CPU or GPU. It is an open-source project developed by the Montreal Institute for Learning Algorithms group at University of Montreal. Some of its most prominent features include transparent use of GPU, tight integration with NumPy, efficient symbolic differentiation, speed / stability optimizations, and extensive unit testing.

Unfortunately, Youshua Bengio (head of MILA lab) announced in Nov. 2017 that they will no longer actively maintain or develop Theano. The reason is that most of the innovations Theano introduced across the years have now been adopted and perfected by other frameworks. If interested, you can still contribute to its open-source library.

Theano is similar in many ways to TensorFlow. So let’s take a look at another code example training neural networks using the same batch size and input/output dimensions:

import theano
import theano.tensor as T

# Batch size = 32, Input Dimension = 500, Hidden Dimension = 50, Number of Classes = 5

# Define symbolic variables
x = T.matrix('x')
y = T.vector('y', dtype='int64')
w1 = T.matrix('w1')
w2 = T.matrix('w2')

# Forward pass: compute scores
a = x.dot(w1)
a_relu = T.nnet.relu(a)
scores = a_relu.dot(w2)

# Forward pass: compute softmax loss
probs = T.nnet.softmax(scores)
loss = T.nnet.categorical_crossentropy(probs, y).mean()

# Backward pass: compute gradients
dw1, dw2 = T.grad(loss, [w1, w2])

# Compile function
f = theano.function(
  inputs = [x, y, w1, w2],
  outputs = [loss, scores, dw1, dw2],
)

# Run the function
xx = np.random.rand(32, 500)
yy = np.random.randint(5, size=32)
ww1 = 1e-2 * np.random.randn(500, 50)
ww2 = 1e-2 * np.random.randn(50, 5)

learning_rate = 1e-1
for t in xrange(20):
  loss, scores, dww1, dww2 = f(xx, yy, ww1, ww2)
  print loss
  ww1 -= learning_rate * dww1
  ww2 -= learning_rate * dww2

I first define Theano symbolic variables (similar to TensorFlow placeholder). For the forward pass, I compute predictions and loss; for the backward pass, I compute gradients. Then I compile a function that computes loss, scores, and gradients from data and weights. Lastly, I run this function many times to train the network.

3 — PyTorch (Facebook)

PyTorch is a relatively new deep learning framework that is quickly becoming popular among researchers. The Facebook AI Research team developed it to address challenges in the adoption of its predecessor library, Torch. Due to the low popularity of the programming language Lua, Torch can never experience the growth that Google’s TensorFlow has.

Thus, PyTorch adopted the native Python imperative programming style, which is already familiar to many researchers, developers, and data scientists. It also supports dynamic computation graphs, a feature that makes it attractive to researchers and engineers working with time-series and natural language processing data.

The best adoption so far has come from Uber, which has built Pyro — a universal probabilistic programming language using PyTorch as its backend. PyTorch’s dynamic ability to perform differentiation and construct gradients is extremely valuable for random operations in a probabilistic model.

PyTorch’s 3 Levels of Abstraction:

Tensor: Imperative ndarray, but runs on GPU
Variable: Node in a computational graph; stores data and gradient
Module: A neural network layer; may store state or learnable weights

Here I am going to focus on the Tensor abstraction level. PyTorch Tensors are just like numpy arrays, but they can run on GPU. No built-in notion of computational graph, or gradients, or deep learning. Here we fit a 2-layer net using PyTorch Tensors:

import torch

# Batch Size = 32, Input Dimension = 500, Hidden Dimension = 50, Output Dimension = 5

dtype = torch.FloatTensor

# Create random tensors for data and weights
x = torch.randn(32, 500).type(dtype)
y = torch.randn(32, 5).type(dtype)
w1 = torch.randn(500, 50).type(dtype)
w2 = torch.randn(50, 5).type(dtype)

learning_rate = 1e-6
for t in range(250):
  # Forward pass: Compute predictions and loss
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)
  loss = (y_pred - y).pow(2).sum()
  
  # Backward pass: Compute gradients
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)
  
  # Gradient descent step on weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

As you can see, I first create random tensors for data and weights. Then I compute predictions and loss during forward pass, and compute gradients manually during backward pass. I also add gradient descent step on each weight. Finally, I train the network by running the function many times.

4 — Torch (NYU / Facebook)

Let’s talk about Torch a bit. It is Facebook’s open-source machine learning library, scientific computing framework, and script language based on the Lua programming language. It provides a wide range of algorithms for deep learning, and has been adapted by Facebook, IBM, Yandex, and others to solve hardware problems for data flows.

As the direct ancestor of PyTorch, Torch shares a lot of its C backend. Unlike PyTorch which has 3 levels of abstraction, Torch only has 2: Tensor and Module. Let’s explore a code tutorial training 2-layer net using Torch’s Tensor:

require 'torch'
require 'nn'
require 'optim'

-- Build a model as a sequence of layers, and a loss function
local model = nn.Sequential()
model:add(nn.Linear(500, 50))
model:add(nn.ReLU())
model:add(nn.Linear(50, 5))
local loss_fn = nn.CrossEntropyCriterion()

local x = torch.randn(32, 500)
local y = torch.Tensor(32):random(5)
local weights, grad_weights = model:getParameters()

-- Define callback
local function f(w)
  assert(w == weights)
  -- Forward Pass
  local stores = model:forward(x)
  local loss = loss_fn:forward(scores, y)
  
  -- Backward Pass
  grad_weights:zero()
  local grad_scores = loss_fn:backward(scores,y)
  local grad_x = model:backward(x, grad_scores)
  
  return loss, grad_weights
end

-- Pass callback to optimizer
local state = {learningRate=1e-3}
for t = 1, 25 do
  optim.adam(f, weights, state)
end

Initially, I build a model as a sequence of layers, as well as a loss function. Next, I define a callback function that inputs weights and produces loss/gradient on weights. Inside the function, I compute predictions and loss in forward pass, as well as gradient in backward pass. Lastly, I pass this callback function to the optimizer over and over again.

5 — Caffe (UC Berkeley)

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It was developed by the Berkeley AI Research group and the Berkeley Vision and Learning Center. While its core is written in C++, Caffe has Python and Matlab bindings. It’s very good for training or fine-tuning feedforward classification models. While it’s not used as much in research, it’s still popular for deploying models as evidenced by the community contributors.

In order to train and fine-tune neural networks using Caffe, you’ll need go through 4 steps:

Convert the data: We read the data files, then clean and store them in a format that can be used by Caffe. We’ll write a Python script that handles both data pre-processing and storage.
Define the model: A model defines the structure of a neural network. We choose a CNN architecture and define its parameters in a configuration file with extension .prototxt.
Define the solver: The solver is responsible for model optimization by defining all information about how gradient descent will be conducted. We define the solver parameters in a configuration file with extension .prototxt.
Train the model: Once we have the model and solver ready, we train the model by calling the caffebinary from the terminal. After training the model, we’ll get the trained model in a file with extension .caffemodel.

I won’t do a code walk-through for Caffe, but you can check out a tutorial on Caffe’s main page. Overall, Caffe is really good for feedforward networks and for fine-tuning existing networks. You can easily train models without writing any code. Its Python interface is quite useful, as you can deploy the model without Python code. On the negative side, you need to write core C++ code (under Caffe) for every new GPU layer. Thus, it’s very cumbersome for big networks (AlexNet, VGG, GoogLeNet, ResNet etc.)

Which Deep Learning Frameworks Should You Use?

With Theano no longer being developed, Torch written in the unfamiliar language Lua, and Caffe being in its precocious state, TensorFlow and PyTorch emerge as the preferred frameworks of most deep learning practitioners. While both frameworks use Python, there are a couple of differences between them:

PyTorch has a cleaner interface and is easier to use, especially for beginners. Writing code (for the most part) feels intuitive, instead of fighting against the library. TensorFlow, on the other hand, is much more cumbersome with so many small, obscure libraries.
However, TensorFlow comes with much more support and a very large, vibrant, and helpful community. This means that there are more online courses, code tutorials, docs, and blog posts for TensorFlow than for PyTorch.
That said, PyTorch, as a new platform, has many interesting features that haven’t been implemented yet. It’s amazing how much PyTorch has achieved in little more than a year.
TensorFlow is more scalable and is very compatible with distributed execution. It supports everything from single GPUs to massive systems which involve heavy distributed reinforcement learning with real-time trials and errors.

Most importantly, TensorFlow is “Define-and-Run”, in which one would define conditions and iterations in the graph structure, then run it. On the other hand, PyTorch is “Define-by-Run”, in which graph structure is defined on-the-fly during forward computation. In other words, TensorFlow uses static computational graph, while PyTorch uses dynamic computational graph.

The dynamic graph-based approach gives easier debuggability and more processing power for complex architecture, such as dynamic neural networks. The static graph-based approach gives easier deployment to mobile, easier deployment to more exotic architectures, and the ability to do compiler techniques ahead of time.

For that reason, PyTorch is better for rapid prototyping for hobbyists and small-scale projects, while TensorFlow is better for large-scale deployments, especially when cross-platform and embedded deployment are considerations.

TensorFlow has stood the test of time and is still more widely used. It has more capabilities and better scalability for larger projects. PyTorch is gaining momentum as it’s easier to learn, but it doesn’t have the same integration. It’s very good for small projects that need to be done quickly, but is not optimal for product deployment.

Final Takeaways

The list above is just the cream of the crop, mostly supporting Python. There have been multiple new deep learning frameworks released in the last few years, such as DeepLearning4j (Java), Apache’s MXNet (R, Python, Julia), Microsoft CNTK (C++, Python), and Intel’s Neon (Python).

Every framework is different, as they’re developed by different people and for different purposes. Having a broad overview of this landscape will certainly give you a good sense of what’s out there for you to solve your next deep learning challenge.

Ease of use (in terms of architecture and speed), GPU support, availability of tutorials and training materials, neural network modeling capability, and languages supported are all important considerations when choosing which one is best for you.

If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!

Discuss this post on Hacker News.