Hello readers! I will be discussing some cool stuff here, today. So, just wear some black goggles and follow along.
Of the modern world’s technological advancements, I’d argue that, most elegant of them all, are the strides we’ve made in allowing computers to have the power of human-like perception.
Yes, This brilliant idea of training a computer so that it learns like a human, behaves like a human, acts like a human, always seemed distant. But now, through advancements in neural networks and computational power, the dream is real.
Visual intelligence is given to computers through something called CNNs. These Convolutional Neural Networks are an elegant computation mechanism run on images or video in order to extract some information from them. The info extracted allows for machine learning tasks like image classification and object localization (i.e. object detection).
Object detection is used to locate object(s) in an image or a video frame by drawing a bounding box around it. We can think of people tracking as a form of object detection, in which the target objects are people! Before we jump in, let me give a bit of an overview of the logic of how people tracking works.
- Bounding boxes around each person are detected for the first frame of the video. That area of the image is used to generate a 128 dim vector. So you can think of that step as, encoding the information of that part of the image into a 128 dim vector.
- The vectors are generated for all the persons in the image, along with the bounding box coordinates. These vectors are then stored, and a similar vector generation step is done for the next frame of the video.
- Now all the vectors are compared, and we find the similar vectors in the next frames and accordingly label the bounding box.
Now, let’s dive a little deeper.
The paper that I’m going to be digging into here is: A Simple Baseline for Multi-Object Tracking.
This brilliant paper has changed the way we tackle this object tracking problem. Earlier, people were using a two-stage detector (TSD) approach to solve similar problems. But, this paper enhanced the single-shot detection (SSD) technique for this problem, which increased the inference speed and accuracy.
Some of you might be scratching your head around the distinction between TSD and SSD approaches. I can explain those in details, but that is not the point of this blog. For now, I will give a quick summary of the distinction.
So that was some bird’s eye view definition of those heavy terms. You can check out at this helpful discussion to learn a bit more about them.
The frame on which we’re going to run inference is passed through the SSD network. This network will give some outputs from different heads, which we essentially use for our task of finding a bounding box around a person. The other outputs are used for that person, which is in a 128 dim vector form.
Again, there are some terms you might not be familiar with, so I will give a brief overview of them.
Traditionally bounding boxes are found using a concept called anchor boxes. An anchor box in itself is a topic for just one blog, which you can find on internet pretty easily. I’ll just try to quickly provide a laymen’s understanding of it.
Anchor boxes are a set of predefined numbers (four numbers) similar to bounding box coordinates that we re-scale or shift so that we can get closer to the actual bounding box in an image.
Anchor boxes look something like the image above, when zoomed in, but they exist in huge numbers and cover almost all of the image.
If you want to know more about anchor boxes then, check out this article, where it’s explained beautifully.
The authors of our primary paper are not using this concept, however, as they see a flaw in this kind of proposal for object tracking purposes—again, technical details aren’t the exact motto of this blog post. For those specifically, maybe I’ll write another post for the seekers.
So if they aren’t using the technique that every state-of-the-art object detection model is using…then what are they using!
Next, I’ll explain their solution and implementation.
There are lots of details involved in the steps that they are using to get the task done. I will explain them in as simplistic way as I can.
The architecture is divided into two branches:
- Object Detection Branch
- Identity Embedding Branch
Object Detection Branch:
As the name suggests, this part of the model is responsible for detecting the objects in the image. This branch spits out three set of numbers, which are combined to detect the objects present in the image.
The set of three numbers represent:
1. Heatmap information: This head of the branch is responsible for generating numbers that represent the heatmap of objects. The target heatmap will be generated by the formula:
This formula might be tricky to understand, but believe me we don’t need to understand the underlying math completely. For simplicity, let’s assume we are creating a target heatmap depending upon the center of all the ground-truth boxes. The center of the ground truth will have a high value and it will exponentially decay as we move away. It will be similar to the image below. Not exactly similar, but just to give you an idea of what this looks like.
So this is the target image, and for this, there will be a corresponding predicted image (set of numbers from the heatmap branch) that we’ll use to the loss that will be optimized using a variation of stochastic gradient descent, which is the crux of a neural network.
If you want to know a bit more about SGD, you can refer here. Basically it’s an optimization technique, to help train our networks so that the actual output is similar to the target output.
2. Center Offset information: This head of the branch is responsible for predicting the center of the bounding box. Basically it outputs a number that explains where the center of the target bounding box is. There is a specific reason for this head to tackle some quantization errors, which I will explain in another blog, in which I’ll dive deeper into the original research.
3. Box size information: This head of the branch is responsible for estimating the size of the box. By combining the numbers that we get via the center offset head and box size head, we can generate a predicted bounding box. This predicted bounding box is then compared against the target bounding box, and the loss is calculated and optimized using SGD, as previously mentioned.
Identity Embedding Branch
As the name suggests, this branch is responsible for generating the vector representation of the image patch that corresponds to the predicted bounding box. Basically, it’s encoding the information from the patch of the image into a 128 dimensional vector.
Again, this 128 dimensional vector is nothing but a set of numbers generated by this branch of the model for each bounding box it predicted. This vector representation is the key feature for tracking the person in corresponding frames.
Now we’ve looked at each piece of the person tracking puzzle, and we have a baseline understanding of these pieces. Let’s go ahead and connect them to each other to get a better picture.
So let’s start with the first frame of a video. We pass it to the network (CNN ) and obtain 4 outputs.
- Heatmap output
- Center offset output
- Box size output
- Re-ID (128 dimensional Re-Identification feature vector)
The first three outputs are responsible for getting the bounding boxes of the objects in the image, which are obtained by some mathematical formulae (not important here).
The fourth output represents the identification of those objects that are represented by the bounding boxes generated by first three outputs. For clarity, if there are n bounding boxes predicted from the first three outputs, then there will be n 128-dimensional vector identities representing all the boxes.
Now we have n boxes predicted from the first frame and their corresponding Re-IDs. We will again generate the Re-IDs and bounding boxes for the next frame, and then compare all the Re-IDs by some similarity function, and if the similarity is high, we can label it as the same person from the previous frame.
In such a manner, we can proceed for whole video and operate on a frame-by-frame basis.
How it’s trained
There are multiple losses that are combined to train the network. The network training is done using a variation of SGD. I will give the high-level details of some difficult formulae, which are combined to be used as a loss function to train the model.
The different losses are (taken from original paper):
You don’t have to be terrified by all these equations. We don’t need to understand them completely. This was just taken as a reference from the paper. For the seekers, I will explain them in my future blogs.
So basically, we combine all these losses and optimize them and get a trained network. YAY!!
Wrapping up the theory
In conclusion, we can say that we use vectors generated from the network for all the bounding boxes detected, and we match those vectors in the next frame with all the vectors and filter on the basis of high similarity to track the same people across multiple frames.
People Tracking — Code Implementation
One final turn on this ride, but an exciting one. Here, I will be giving a high-level overview of how to use this for your custom video. There is a great GitHub repository that has a trained model ready for us to use. I’ll be using this for our demo purposes
To generate the output from a video that you want to use for tracking, you have to perform following 5 easy steps:
Clone the git repository
Start by cloning this git repository (https://github.com/harsh2912/people-tracking) into a desired folder. This repository is the clone of original code provided here, which is the implementation of the tracking algorithm that I discussed above. I have made a few changes on top of it so as to make it suitable for our current task of generating output from a video.
Download the model
Next, download the pre-trained model from here, which will help us generate the desired output. Put the downloaded model in models/ folder.
Next, we have to install the required dependencies so that our script can run. Please make sure that you have a CUDA-enabled GPU so that this process can work. These steps are already mentioned in the repository. Here again, I’ll copy/paste them so that you can also do the same:
Here you’re simply creating a conda environment called FairMOT, in which you’ll install all the Python dependencies using pip and conda managers. requirements.txt is the file that contains all the required libraries, which will be taken care by pip installer.
Do not forget to run the make file of DCNv2.
Now that we have everything setup, we’re good to go. We just have to run the Python script, which works on a given video, and generate the output and save it in the mentioned path. I have named the Python script as script.py, but you can call it whatever makes sense for you.
This script has three input arguments: model_path (mp), video_path (vp), output_directory (od). As the name suggests model_path is the path where the model is kept, video_path is the path from where the video will be loaded, and output_directory is the path where you want the output to be saved. In the output_directory, the output will be saved with the name “output.avi”.
Go to the src folder and run the above command where mp, vp, and od are the argument names.
Go take a snack break
In the time it takes to grab a quick snack, you should get the output saved in the desired directory.
I have also uploaded a iPython notebook named as Experiment.ipynb where you can tweak different parameters to play around more. It is in the src folder in the primary repo. To learn about tweaks of different parameters, please go through the original repository here most of the parameters are self explainable. You can also play around in that notebook.
Sorry for the long article. Here’s a Smart Potato