Computer Vision and Deep Learning: From Image to Video Analysis

Computer vision, at its core, is about understanding images. The field has seen rapid growth over the last few years, especially due to deep learning and the ability to detect obstacles, segment images, or extract relevant context from a given scene.

Using computer vision, we can build autonomous cars, smart buildings, fashion recommender systems, augmented reality tools…the possibilities are endless.

One area in particular is starting to garner more attention: Video. Most applications of computer vision today center on images, with less focused on sequences of images (i.e. video frames).

Video allows for deeper situational understanding, because sequences of images provide new information about action. For example, we can track an obstacle through a sequence of images and understand its behavior to predict the next move. We can track a human pose, and understand the action taken with action classification.

When analyzing videos, we create new use cases and move from “this image contains 3 people” to “this images contains 3 people playing X”.

Before we start, I invite you to join the Think Autonomous Mailing List and learn every day about self-driving cars, Computer Vision, and Artificial Intelligence.

Video Analysis Algorithms

Obstacle tracking & video analysis — An active area of research

Whether for surveillance camera systems or football analysis, the next generation of computer vision algorithms will include time.

The task of video surveillance involves two kind of algorithms:

Object tracking
Action classification

1. Object Tracking

A video is a set of frames. When studying a video, we can either study a video stream (live image feed) or a video sequence (fixed-length video).

In a video stream, we consider the current image and the previous ones.
In a video sequence, we have access to the full video, from the first image to the last.

Videos take up a lot of storage space and are usually not already using AI. This means that, with video, we simply have raw image data to work with.

But there is a key difference. Specifically, motion. Motion is the only difference between an image and a video. It’s a powerful thing to track and can lead to action understanding, pose estimation, or movement tracking.

Optical Flow

The output optical flow is a vector of movement between frame 1 and frame 2. It looks like this:

A lot of existing datasets address the optical flow problem, such as KITTI Vision Benchmark Suite or MPI Sintel. They both contain ground truth optical flow data, which is generally hard to get from more popular datasets.

To solve the optical flow problem, convolutional neural networks can help.
FlowNet is an example of a CNN designed for optical flow tasks, and it can output the optical flow from two frames.

The input of the network is a set of two RGB images; thus it has a depth of 6.
Optical flow is often represented by colors.

Visual Object Tracking (VOT)

First, we can simply track objects visual Object Tracking (VOT) is the science of tracking an object given its position in frame 1. We are not using a detection algorithm here—thus, we’re model free. In other words, we don’t know what we are tracking. We are simply given a starter bounding box and are asked to keep track of this object all along.

Tracking is performed by computing the similarity between frames 0 and 1. We check what’s in the bounding box and try to retrieve it in the next frame.
We can then move the bounding box a bit and track our obstacle.

Other features, such as color, can also be used to track the objects. Here, we compute the color of the given object and then compute the background that represents is the closest color to the object. Then we remove it from our original image to track it.

This is very powerful technique, and it only uses computer vision. We don’t need a single neural network to do this. To summarize this process:

We receive the initial object to track using a bounding box
We compute a color histogram of this object
We compute the color of the background (near the object)
We remove the object color from the total image
We now have a color-based obstacle tracker

The next step is to apply CNNs for this task

We must distinguish two main models here: MDNet and GOTURN.

An MDNet (Multi-Domain Net) tracker trains a neural network to distinguish between an object and the background.

The architecture looks like a VGG model—in the end, we have domain-specific layers (object vs background classifier).

GOTURN (Generic Object Tracking Using Regression Networks) works by using two neural networks and specifying the region to search. It can work at over 100 FPS, which is amazing for the task of video tracking.

Multiple Object Tracking (MOT)

The last family of trackers is referred to as multiple object tracking. Here’s a look at MOT in practice:

Unlike the other family of trackers (VOT), MOT is more long-term.

We distinguish two variants:

Detection-Based Tracking
Detection-Free Tracking

Let’s consider Detection-Based Tracking. We have two tasks here:

Object detection
Object association

Object association means that we have to associate detections from time t to detections from time t+1. It relies heavily on the quality of the detector.
A bad detector will render the tracker not functional.

A good tracker should handle a few frames with no detections.

We can also distinguish between online and offline tracking. Online tracking means that we are on a live feed. Offline tracking is working on a full video, and we have future frames available.

For online tracking, we’re tracking the bounding boxes detected by the CNN. We can use:

A CNN for the detection
A Kalman Filter to predict the position at time t from the position at time t-1
The Hungarian Algorithm for detection of frame association

The matching metric for the Hungarian algorithm can be IOU (Intersection Over Union) or deep convolutional features. Using deep convolutional features allows for re-identification after occlusion but slows down the tracker.

Multi Object Tracking is adding possibilities to obstacle detection like Game Analysis or Behavioral Prediction.

For more on this, check out my article Computer Vision for Tracking!

2. Action Classification

Action classification is the second family of tasks involved in building computer vision-based surveillance systems. Once we know how many people we have in the store, and once we know what they’ve been doing, we must analyze their actions.

Action classification depends directly on object detection and tracking—this is because we first need to understand a given situation or scene. Once we have that understanding, we can classify the actions inside the bounding box.

First, we must choose the camera that sees them with the best angle. Some angles might be better than others. If we choose the correct camera every time—for example, the camera that shows a face—then we can be sure we have a workable image.

Actions can be really simple, like walking, running, clapping, or waving. They can also be more complex, like making a sandwich, which involves a series of multiple actions (cutting bred, washing tomatoes, etc).

Datasets

Labeling is much easier for classification than for tracking—we can simply assign a label to a set of images.

The KTH Actions Dataset is good for gathering videos and associated labels. The UCF Sport Actions is a dataset that’s sports-oriented, but it includes useful samples.

More recently, datasets like Hollywood2 for movies scenes, HMDB, or UCF 101 have been released.

Optical flow

Since optical flow is used to determine a vector between two frames. It can be used as an input for a classification neural network.

Action Classification with Machine Learning (End-To-End)

The more recent and modern solution would be to use CNNs.

Action happens in a video, not an image. This means that we must send multiple frames to the CNN, which must then perform a classification task on what’s called a space-time volume.

With an image, it’s hard enough to do object detection or classification due to the image size, its rotation, etc. In a video, it’s even more difficult.
Here’s an example of a two-scale model working to classify actions from image streams.

These neural networks work on 2 inputs and output an action. The spatial stream is working on a single image; it’s stacked with the temporal stream working on an input optical flow. A linear classifier is applied here.

There are a lot of action classification networks that already exist—it’s a hard problem to solve.

Pose Estimation

Finally, know that pose estimation is another deep learning technique used as a mean for action classification.

The process of pose estimation includes:

Detecting keypoints (similar to facial landmarks)
Tracking these keypoints
Classifying the keypoints’ movement

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.