Real-Time 3D Object Detection on Mobile Devices with MediaPipe

Object detection has always been one of the most extensively studied problems in computer vision. Being one of the most basic and rudimentary forms of detection, and unparalleled in its usefulness, object detection has been used in many commercial computer vision based applications.

However, recently some light has been shed on an earlier oversight, which is now leading to an entirely new set of problems in the relatively mature field.

Specifically, most research that has been carried out focuses exclusively on 2D object detection. This means that the bounding boxes that are drawn around the detected object are only two dimensional. Most of the research in this field has included methods to increase the accuracy of detected objects, increase the speed of detection, or incorporate new methods and architectures by which one could detect and track objects.

However, in present days, with new and evolving ideas, there’s a need for a method to more accurately detect and track objects in real world environments.

To tackle this problem, Google AI has released the MediaPipe Objectron, a mobile, real-time 3D object detection pipeline for everyday objects, which also detects objects in 2D images.

Implemented in MediaPipe, an open-source cross-platform framework for building pipelines for the processing of multimodal perceptual data, Objectron computes 3D bounding boxes around objects and orients them in real-time on mobile devices.

Why do we need 3D bounding boxes?

One could argue that 2D bounding boxes have endured and overcome all previous challenges put before them, and that 3D boxes serve no viable purpose other than a research proposal.

Although 2D object detection has been used widely in the industry, 3D object detection is now making its mark in the market, with a number of different industries coming up with ideas that would require a three dimensional view of a given object and use of their spatial attributes

This approach could open up an entirely new set of tasks. These include but are not limited to robotics, image retrieval, augmented reality, and self-driving vehicles.

Augmented reality and 3D modeling have especially captured interest, and more recently, Snapchat’s Ground Transformation became an instant hit in the AR world.

This wasn’t the first time a company has been able to achieve a show-stopping application using augmented reality, and it certainly won’t be the last. As artificial reality continuously advances with time, there has been and will be more and more emphasis on the use of such technologies for the purposes of branding and advertising.

A good example would be the IKEA’s furniture display app, which uses AR for product placements in real-world environments. However, this field has certain limitations and bottlenecks, owing to the inability of cameras to properly scale digital object sizes in order to match them exactly with the real-world environment.

3D modeling would allow users to accurately capture an object’s size, orientation, and position, which could pave the way for further advancements in virtual product placement and advertising tools.

By introducing MediaPipe’s Objectron, Google has also hinted that there will be a significant shift in AR rendering in the years to come.

But what is MediaPipe’s Objectron?

In the simplest terms, the tech giant has developed a method that would serve as a mobile, real-time 3D object detection pipeline for objects used in our day-to-day lives.

Implemented in MediaPipe (hence the name), which is an open-source and cross-platform framework for the development and maintenance for pipelines, Objectron essentially gives the user a 3D perspective on how object detection takes place in day-to-day, real-world environments.

How does it work?

For the more curious among us, there’s always an urge to know how things work before we can try them out. If you’re one of those people who like to know exactly what they’re trying out, this section is for you!

MediaPipe’s Objectron uses what can essentially be described as an ML pipeline for fulfilling its main task. It uses a single-stage model to predict the pose as well as the physical size and structure of an object from a single RGB image.

Its structure is, primarily, that of an encoder-decoder, built upon MobileNetV2. It uses a multi task learning approach — which is just a fancy term for signifying that multiple learning tasks run at the same time, exploiting the similarities and differences across different tasks.

The shape task predicts the object’s shape signals based on the type of ground-truth annotation available. For detection, these annotated bounding boxes are used with the goal to fit this into a Gaussian distribution, and to predict this distribution with its peak as the center of the object.

To obtain the 3D bounding boxes, Objectron uses an established pose estimation system — Efficient Perspective-n-Point Estimation—which can recover the 3D bounding box of an object without prior information of an object’s dimensions.

Detection and Tracking in MediaPipe

Let’s not forget the very important role that MediaPipe plays in this entire project. From the above text, it may not be clear where MediaPipe comes into play. Made plain, MediaPipe essentially is responsible for the real-time part of this project.

Being a complex model running on an edge device, when it’s applied on every frame, jitter and lag would likely ensue (mainly owing to the ambiguity of the predicted 3D bounding boxes). This framework would mitigate the need to run the model on each and every frame that comes in. This allowed the researchers to apply heavier and more accurate models without suffering from delays in the output.

To ensure further efficiency and definitiveness, the researchers decided to run model inference only once every few frames, taking the predictions and tracking them over time. Every time a new prediction is made, the result is consolidated with the previous one based on the area of the overlap.

Where would this be used?

Now that we’re well-acquainted with how Objectron works, the obvious question arises — where would it be used?

As it stands, most are predicting that MediaPipe’s Objectron is more or less a hint from Google towards the future of AR rendering. Objectron, with its 3D object detection capabilities, would make a massive impact in the AR area.

Moreover, the company’s novel approach, AR Synthetic Data Generation, can place virtual objects into scenes that hold AR session data. It has also resulted into the seamless integration of high-quality synthetic data and real world environments.

Google, with the help of MediaPipe’s Objectron, has also been able to help with camera poses, detecting planar surfaces, and giving rough estimations of the lighting conditions in an environment, all of which are used to generate placements that are not only physically probable but match the scene they’re being fitted to, in much more realistic ways.

After all, AR will help companies take immense advantage of the more immersive and engaging visual aspects of these elements for developing new marketing strategies.

With such extraordinary initial usages, 3D modeling is expected to make virtual object placement much smarter and more riveting, with continuous advancements like these in the areas of edge vision for product placements.

Want to try it out?

Google has released an end-to-end demo mobile application to encourage researchers (or anyone with an innovative idea!) to experiment and prototype with this technology.

After all, the best ideas can come from anywhere. The on-device ML pipeline in MediaPipe allows users to sample trained models for two categories — shoes and chairs. The application allows users to implement these models on their devices and view 3D bounding boxes, all in real-time using TensorFlow Lite with GPU acceleration.

Users can download an arm64 build of Objectron for shoes and chairs.

To build and install the Objectron for shoes:

To build and install the Objectron for chairs:

Now that the application is built, we can install it on an Android device by running the following command:

More information can be found on their GitHub page, here.

Whats next?

By sharing this with the world, it’s clear that the broader research and development community would undoubtedly generate new and innovative ideas, use cases, and propel further research efforts to improve underlying models.

On its blog, Google AI has announced that it plans to scale its model to include many different categories and further improve the model’s on-device performance. This progress will be worth following for all developers interested in improving on-device machine learning and augmented reality experiences.

Real-Time 3D Object Detection on Mobile Devices with MediaPipe

Google’s foray into unexplored areas of vision on edge devices using ML pipelines

Table of contents: