Combining artificial intelligence and augmented reality in mobile apps

Augmented reality (AR) and artificial intelligence (AI) are two of the most promising technologies available to mobile app developers. Huge hype cycles and rapidly evolving tools, though, have blurred the lines between the two, making it difficult to tell where AI ends and AR begins. This post aims to disambiguate AR and AI. It covers how AR and AI work together, the current state of SDKs and APIs for each, and some practical ways to combine them to build incredible mobile experiences.

Definitions

First, some definitions.

Augmented reality is an experience that blends physical and digital environments. Think Pokemon Go or Snapchat. Computer generated objects coexist and interact with the real world in a single, immersive scene. This is made possible by fusing data from multiple sensors — camera(s), gyroscopes, accelerometers, GPS, etc. — to form a digital representation of the world that can be overlaid on top of the physical one.

Artificial intelligence (and more precisely machine and deep learning) includes algorithms and statistical models capable of performing tasks without explicit instructions. Machine learning models are shown training data from which they learn patterns and correlations that help them achieve their goals. These models are the engines inside things like predictive keyboards and intelligent photo organizers.

Putting it all together

Augmented reality and artificial intelligence are distinct technologies, but they can be used together to create unique experiences.

In augmented reality, a 3D representation of the world must be constructed to allow digital objects to exist alongside physical ones. Visual data is used along with an accelerometer and gyroscopes to build a map of the world and track movement within it. Most of these tasks are still done using traditional computer vision techniques that make no use of machine learning.

Independently, however, AI models have gotten incredibly good at doing many of the things required to build immersive AR experiences. Deep neural networks can detect vertical and horizontal planes, estimate depth and segment images for realistic occlusion, and even infer 3D positions of objects in real-time. Because of these abilities, AI models are replacing some of the more traditional computer vision approaches underpinning AR experiences.

AI models are also being layered on top of AR. The same segmentation models that perform people occlusion can also create effects like Black Mirror’s infamous people blocker. Faces can be swapped in real-time, changing a person’s age or gender. When done right, AI and AR work seamlessly together to create immersive mobile experiences.

Where does AI end and AR begin?

Because AI can be used above and below AR scenes, it can be difficult to know which tools provide which functionality. When building a mobile app, you’ll be switching back and forth between various APIs to build the experience you want.

Let’s run through a few of the most popular developer tools and when to apply each:

ARKit and ARCore

ARKit and ARCore are the canonical augmented reality SDKs on iOS and Android, respectively. Though they differ slightly in their APIs, they perform the same basic functions. They combine data from a device’s sensors to build the 3D world, track movement, render digital objects, and mediate interactions between digital and physical content. You’ll use them primarily to place and manipulate objects within scenes. Though they may make use of AI, those models are typically abstracted away from users, who are given access to high-level outputs (e.g. occlusion masks for people).

Core ML and TensorFlow Lite

Core ML and TensorFlow Lite are the on-device AI frameworks for mobile devices. They’re used to execute models independent of augmented reality. These APIs provide low-level control of input and output data to models and allow developers to insert their own custom models, which are trained to perform specific tasks relevant to their applications.

The most common way for developers to combine AR and AI models is to take images or audio from a scene, run that data through a model, and use the model output to trigger effects within the scene. Here are a few examples:

Image or scene labeling: A camera frame is run through an AI model that classifies an image. The classification triggers an AR label for that location.
Object detection: A camera frame is passed to an AI model that estimates the position and extent of objects within a scene. Location information is then used to form hit boxes and colliders that facilitate interactions between physical and digital objects.
Semantic segmentation and occlusion: While ARKit may provide generic people occlusion capabilities, a custom AI model can be used to segment and occlude cars or other objects.
Pose estimation: An AI model infers the position of objects like hands and fingers, which are used to control AR content.
Text recognition and translation: An AI model detects, reads, and translates text in an image. Augmented reality APIs are then used to overlay translated text back into the 3D world.
Audio recognition: AI models listen to specific words that trigger AR effects. For example a user says the word “Queen” and a virtual crown appears on their head.

Conclusion

Augmented reality and artificial intelligence are separate but complementary technologies. Smaller, faster, and more accurate AI models will be the engines of AR functionality, given their ability to track and understand the 3D world. They’ll also continue to enhance AR experiences, adding effects and interactivity to AR scenes.