Real-time Object Detection using SSD MobileNet V2 on Video Streams

An easy workflow for implementing pre-trained object detection architectures on video streams

In this article, we’ll be learning the following:

  1. What object detection is
  2. Various TensorFlow models for object detection.
  3. Implementing MobileNetV2 on video streams.
  4. Conclusion
  5. References

What is Object Detection?

Object detection can be defined as a branch of computer vision which deals with the localization and the identification of an object. Object localization and identification are two different tasks that are put together to achieve this singular goal of object detection.

Object localization deals with specifying the location of an object in an image or a video stream, while object identification deals with assigning the object to a specific label, class, or description. With computer vision, developers can flexibly do things like embed surveillance tracking systems for security enhancement, real-time crop prediction, real-time disease identification/ tracking in the human cells, etc.

TensorFlow Model Zoo for Object Detection

The TensorFlow Model Zoo is a collection of pre-trained object detection architectures that have performed tremendously well on the COCO dataset. The model zoo can be found here. The model architectures include:

  1. CenterNet
  2. EfficientDet
  3. MobileNet
  4. ResNet
  5. R-CNN
  6. ExtremeNet

CenterNet (2019) is an object detection architecture based on a deep convolution neural network trained to detect each object as a triplet (rather than a pair) of keypoints, so as to improve both precision and recall. More information about this architecture can be found here.

EfficientDet (2019) is an object detection architecture built to scale up model efficiency in computer vision. This architecture achieves much better efficiency than prior architectures across a wide spectrum of resource constraints. More information about this architecture can be found here.

MobileNet is an object detector released in 2017 as an efficient CNN architecture designed for mobile and embedded vision application. This architecture uses proven depth-wise separable convolutions to build lightweight deep neural networks. More information about the architecture can be found here.

RetinaNet is an architecture developed by the Facebook research team in 2018. RetinaNet uses a Feature Pyramid Network (FPN) backbone on top of a feed-forward ResNet architecture to generate a rich, multi-scale convolutional feature pyramid. It is a one-staged detector (that is, a single network, unlike R-CNN, which is 2-staged). More information about the architecture can be found here.

R-CNN (2014) is a 2-stage object detection architecture. It is a region-based CNN that uses a Region Proposal Network to generate regions of interests in the first stage, and then sends the region proposal down the pipeline for object classification and bounding box regression. More information about the architecture can be found here.

ExtremeNet (2019) is a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object to find extreme points, by predicting four multi-peak heatmaps for each object category. More information about the architecture can be found here.

Let’s go ahead to trying out one of these model architectures on a typical video stream.

Implementation of MobileNetV2 on video streams

The following steps will help us achieve our object detection goal:

  1. Install the TensorFlow Object detection API.
  2. Download the model file from the TensorFlow model zoo.
  3. Setting up the configuration file and model pipeline
  4. Create a script to put them together.

Installing TensorFlow Object Detection API

To get this done, refer to this blog:

Downloading the model file from the TensorFlow model zoo.

To download the network architecture, you can follow the process below:

  1. Download the MobileNetV2 pre-trained model to your machine
  2. Move it to the object detection folder.
  3. Create a python script to run the real-time program.

Having installed the TensorFlow Object Detection API, the next step is to import all libraries—the code below illustrates that. Do take note that we need other packages like SciPy, NumPy for numerical computation, and PIL and Matplotlib for image processing and visualization:

import matplotlib
import matplotlib.pyplot as plt

import warnings

import io
import glob
import scipy.misc
import numpy as np
from six import BytesIO
from PIL import Image, ImageDraw, ImageFont

import tensorflow as tf

import os, sys
# os.environ['PYTHONPATH'] += "./models"

# import sys
# sys.path.append("./models")

from object_detection.utils import label_map_util
from object_detection.utils import config_util
from object_detection.utils import visualization_utils as viz_utils
from import model_builder

#%matplotlib inline

Having imported all needed libraries, the next step is to write a simple Python script that helps us load images or convert real-time video frames into NumPy arrays. The code below helps us to get this done efficiently:

def load_image_into_numpy_array(image):
  """Load an image from file into a numpy array.
  Puts image into numpy array to feed into tensorflow graph.
  Note that by convention we put it into a numpy array with shape
  (height, width, channels), where channels=3 for RGB.
    path: the file path to the image
    uint8 numpy array with shape (img_height, img_width, 3)
  #img_data =, 'rb').read()
  #image =
  (im_width, im_height, channel) = image.shape
  return image.astype(np.uint8)

Setting up the configuration file and model pipeline

Now that we can efficiently convert video frames into arrays, let’s go ahead and set up the configuration file and model pipeline. To do this, we can follow these steps:

  1. Identifying the path to the pipeline config of our MobileNetV2 model. This configuration file defines the model architecture and params.
  2. Specifying the checkpoint file of the model to be used (model_dir).
  3. Initialize model prediction by passing in the config path of the model.
  4. Use TensorFlow to restore the model’s last checkpoint by specifying the checkpoint directory.

All of the above is completed in the gist below:

#recover our saved model
pipeline_config = './ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.config'
#generally you want to put the last ckpt from training in here
model_dir = 'ckpt-0'
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
detection_model =
      model_config=model_config, is_training=False)

# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(

Next, we need to merge the full pipeline to detect the images and assign labels from the pre-trained model. To do this, the get_model_detection_function in the gist below helps to:

  1. Pre-process the image.
  2. Assign a target label to the object in the image.
  3. Predicts the probability of the target label to each frame in the image.

This script below helps us complete this process:

def get_model_detection_function(model):
  """Get a tf.function for detection."""

  def detect_fn(image):
    """Detect objects in image."""

    image, shapes = model.preprocess(image)
    prediction_dict = model.predict(image, shapes)
    detections = model.postprocess(prediction_dict, shapes)

    return detections, prediction_dict, tf.reshape(shapes, [-1])

  return detect_fn

detect_fn = get_model_detection_function(detection_model)

Last but not least, we need to to initialize label mapping. The label map contains the target label of the pre-trained classes. This is used to help the model specify the label name of every object being identified in the frame. The gist below helps us specify the path to the label map and load all labels with their associated values:

#map labels for inference decoding
label_map_path = configs['eval_input_config'].label_map_path
label_map = label_map_util.load_labelmap(label_map_path)
categories = label_map_util.convert_label_map_to_categories(
category_index = label_map_util.create_category_index(categories)
label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True)

Now that all this is set, the next step is to initialize the video stream with OpenCV, and then initialize a video writer. The code below helps us to get this done:

import random
import numpy as np
import cv2
import tensorflow as tf

cap = cv2.VideoCapture(0)
# Define the codec and create VideoWriter object
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('stb_out.avi',fourcc, 20.0, (640,480))

Putting it all together

Now that we have the video stream and the writer in place, the next step is to keep the video stream live and perform real-time object detection by looping through the frames catpured from the video stream. As long as this keeps running, we can visually see the object detection result by displaying it on our screen.

Finally, once the stream goes off, the video writer then converts all frames captured so far into a video (with the real-time object detection result). The below code helps us get this done from end-to-end.

    # Capture frame-by-frame
    ret,image_np =
    image_np = load_image_into_numpy_array(image_np)

    input_tensor = tf.convert_to_tensor(
    np.expand_dims(image_np, 0), dtype=tf.float32)

    detections, predictions_dict, shapes = detect_fn(input_tensor)

    label_id_offset = 1
    image_np_with_detections = image_np.copy()

          (detections['detection_classes'][0].numpy() + label_id_offset).astype(int),

    # Display the resulting frame
    if cv2.waitKey(1) & 0xFF == ord('q'):

# When everything done, release the capture

Result and conclusion

The video above shows an active demonstration of all we have been talking about. Though this was recorded in ‘BGR’ format, you can always specify ‘RGB’ while trying out your own real-time object detector with the MobileNet V2 architecture.

Lastly, in the video, it took a while before the architecture could identify people at the rear end, as well as a few close by. This doesn’t mean the architecture isn’t capable of doing this. A likely reason is the fact that the work was recorded in ‘BGR’ format, while the network is familiar to the ‘RGB’.

Nonetheless, recording in BGR isn’t a unique reason for not making the model see people from the rear. The reason for this is that BGR images can always be converted to RGB images when fed into the network architecture for real-time prediction.

The performance of the model on unseen data (the video frames) is awesome and unique because the model was able to maintain its pre-trained performance with the COCO dataset on an untrained video stream. To better enhance the performance of this model on frames like the above, we’d need to retrain the architecture on more data, and particularly ground-truth data that represents the core problem we’re trying to solve.

I do hope you’ve learned a lot from this tutorial. If so, do share with friends and colleagues.

Thank you.





Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *