With big industry players committing a lot of resources to edge ML, increases in research into model compression/optimization for resource-constrained devices, and an ever-expanding landscape of use cases, we’ve seen the the initial stages of what we believe is a revolution in how and where machine learning operates.
To help keep track of all the developments, we’ve put together this end-of-year roundup. In an attempt to organize all this, I’ve broken the major news items and developments into the following two sections:
- The Big Players: A look at what the likes of Apple, Google/TensorFlow, and Facebook/PyTorch have contributed to the mobile ML landscape in 2019.
- The Technology: A high-level look at how mobile ML technology (and research surrounding that technology) has evolved over the year.
I’ll then conclude with a couple honorable mentions—news that didn’t quite fit into these sections but is worth a quick note. If there’s anything we missed, or if you’d like to contribute a news item, drop us a note in the comments.
Part 1: The Big Players
To better understand the evolution of mobile/edge ML, it’s essential to start with the industry’s heavy hitters. Their platforms, tools, and dedicated user experiences pave the way for developers and ML teams to create AI-powered edge applications. In this section, we’ll review the best and brightest from some of the industry titans in 2019.
In the last several years, Apple has built out a pretty impressive ecosystem of ML tools. From model training/building tools like Create ML and Turi Create, to their on-device framework Core ML and their dedicated augmented reality API. And there was a lot that happened in 2019. Let’s take a look at the highlights.
Core ML 3
The big news of the year for Apple’s on-device ML ecosystem came in updates for Core ML 3, announced at WWDC. Core ML now includes Model Personalization (i.e. on-device model training); support for more layer types and tensor operators; support for recommenders, audio models, and linked models; and better conversion support from Keras and TensorFlow. You can read Apple’s own review of these updates here.
ARKit, Apple’s dedicated tool for creating augmented reality experiences, also saw a number of important updates and upgrades, the primary ones being announced at WWDC, as well.
For those without any formal 3D design experience, Apple announced Reality Composer and RealityKit, two tools which go hand-in-hand to allow developers to create augmented reality experiences with ease. In code, RealityKit provides — as the name suggests — a way for developers to apply effects to improve the realism of their apps.
People have always been the center of ARKit, but now, it’s more apparent than ever. Three major features in this year’s release of ARKit are people occlusion, motion capture, and the ability to track multiple faces at once. This allows developers to create very immersive experiences, since augmented reality content can now appear to move behind people, and the ability to map people’s motions to a wireframe skeleton allows for much more flexibility. Previously, face detection was limited to one face, but now, developers can detect multiple faces using the front and rear cameras simultaneously.
Among the plethora of iPhone 11 announcements, machine learning featured prominently, with the reveal of the A13 Bionic chip and its triple camera system that enables advanced computational photography powered by ML (dubbed “Deep Fusion”). The A13 has some particularly noteworthy specs, including:
- 8 core neural engine
- 6x faster matrix multiplication
- Up to 1 trillion ops per second on CPU
- Up to 40% more power efficient
And Deep Fusion works by capturing 9 images before a user actually shoots, then takes one long exposure while shooting, and finally processes to enhance detail and lower noise — all powered by on-device ML.
In 2019, we saw on-device ML play a huge role in many of Google’s software and hardware layers, consumer devices, and open-source projects.
Google released the Edge TPU chip, initially announced in 2018. Meant to accelerate inference on edge devices, this chip is a specialized variant of the incredibly powerful Cloud TPU available for training models. The Edge TPU is available as part of a complete single-board computer or as a co-processor that connects to an existing board via USB-C. As of now, Edge TPUs support a limited set of models (e.g. Google’s MobileNet backbone).
Google’s Coral Dev board, which accompanied the release of the Edge TPU, was launched out of beta in October. And in August, Google released dedicated classification models for the Dev Board that leverage the Edge TPU.
GPU support for TensorFlow Lite
Early in 2019, Google announced GPU support for TensorFlow Lite models. While still in developer preview, this is a huge announcement. The 10–100X speed difference between models running on Android vs iOS can be attributed to TFLite’s lack of support for ML accelerators like GPUs and NPUs. Over time, this release should close the gap significantly and make it much easier to develop for both platforms.
TensorFlow for microcontrollers
Along with the release of the Edge TPU and Coral Dev Board, the TensorFlow team also introduced TensorFlow’s support for microcontrollers at their 2019 Dev Summit. This announcement also included the introduction of an embedded board available from SparkFun for $15.
Since this announcement, there’s been a whole lot of discussion, demo projects, and more about this rapidly emerging tech.
TensorFlow Lite 1.0
This year, the transition from TensorFlow Mobile to TensorFlow Lite became (almost) complete. This was cemented with the release of TensorFlow 1.0, the first fully-fledged version of TFLite. This release was also announced at the TensorFlow Dev Summit, and the highlights of the releases included:
- Support for ML accelerators like GPUs and DSPs
- Support for TensorFlow operations with TensorFlow Select
- Model quantization and optimization toolkit
- Model pruning tools
In all, it was an incredible year of progress for TensorFlow Lite, and we’re seeing more and more progress with the recent full release of TensorFlow 2.0. For a deep dive into what TensorFlow Lite looks like as we close out the year, I’d suggest exploring their pretty exhaustive developer guide.
ML Kit’s new NLP APIs
Reaching beyond image and video processing ML features, Google’s on-device ML platform now includes APIs capable of identifying the given language of any text and automatically generating reply suggestions based on previous messages (similar to the Gmail feature). Language Identification recognizes text in 110 different languages within an average of 1–2 ms, and Smart Reply bases its suggestions on the last 10 messages in a conversation.
Full on-device speech recognition
Announced at this year’s Google I/O, the integration of a fully on-device speech recognition system on their voice assistant and within several of Google’s core applications (Gboard, for instance) is incredibly impressive and important. Here are the highlights:
- Full on-device speech recognition model down to 500MB, allowing the Assistant to run up to 10x faster.
- This on-device inference speed allows for seamless interoperability across apps.
- Released on Google’s Pixel 4 this year—and has recently been expanded to Pixel 2 and 3.
Google announced their newest line of Pixel phones at their launch event in October, unveiling a slew of new features and capabilities (many of them leveraging on-device machine learning). The Pixel 4 now has Face Unlock (similar to Apple’s Face ID) and Motion Sense, which creates a 2 ft-diameter hemisphere of spatial awareness around the phone. The Neural Core is an improvement on the Pixel 3’s Visual Core, signifying that Google is moving beyond on-device computer vision (i.e. photo enhancement in the Pixel 3) towards speech and NLP tasks.
Google has announced a number of updates to ARCore over the course of the year. First, there was the introduction of the Augmented Faces API in February, which applies a 468-point 3D mesh upon which devs add mask-like textures, objects, and even facial retouching.
From there, another slew of improvements were announced in September. The Augmented Face API’s creation process was streamlined with a face effects template and more robust support for iOS. The Cloud Anchors API, which allows multiple devices to access the cloud to share info about AR objects within a real-world scene, can produce a more robust 3D feature map. And with Persistent Cloud Anchors, these cloud anchors can last indefinitely, regardless of surface or distance.
And most recently, ARCore’s team introduced a new Depth API, which is in developer preview and allows developers to create a depth map using input from a single RGB camera. It works by taking multiple images from various angles and comparing them as the phone moves, and as a result can estimate the distance to every pixel. This API is a major step forward in dealing with a central problem in augmented reality tech: occlusion, or the ability for digital objects to accurately appear in front of or behind real-world objects.
AutoML Vision Edge
In early October, Google announced a slew of updates to their Cloud AI offering. First, AutoML Vision Edge, which helps with training and deploying models at the edge and was released in 2018, now includes object detection in addition to image classification. Second, AutoML Video is also now capable of performing object detection, allowing devs to track the movement of multiple objects between frames. And lastly, they’ve updated their Video Intelligence API to allow for the recognition, detection, and tracking of popular business logos.
Google Lens and Search
Another big announcement that came out of Google I/O was the integration of Google Lens and Search, which can now power a impressive range of user experiences. In essence, it’s yet another example of Google making ML an implicit part of their core user experiences. Some highlights:
- On-device model only 100kb.
- Real-time user experiences, including: AR objects from search, restaurant bill scanning and calculation, and more.
Over the past year, we’ve seen a lot of really interesting ML research coming out of Facebook (some of which we’ll look at below), but one big release from the PyTorch team is particularly important when it comes to on-device ML.
The release of PyTorch 1.3 included experimental support for a number of mobile-focused features: a deployment pipeline, model quantization, and several front end improvements (like named tensors) that will lead to cleaner code. The experimental build of the end-to-end deployment flow works with Python, with support for both iOS and Android. There will certainly be some kinks to work out in the early days, but it’s highly encouraging that PyTorch is actively working on a solution to help developers deploy their models to mobile.
Part 2: The Technology
In 2019, we saw a whole bunch of incredibly advancements in the tech geared toward mobile and edge machine learning. From dedicated AI hardware and the evolution of Swift for TensorFlow to an ever-increasing body of research around neural network compression, the trends in mobile/edge ML tech promises that in 2020 (and beyond), we’ll see machine and deep learning moving closer and closer to the edge.
Solving the problem of performing inference on resource-constrained devices necessitates a variety of solutions. One of the most pressing challenges is developing hardware capable of making predictions on these edge devices. Luckily, many of the industry’s biggest names are tackling this challenge head on. Here’s a quick look at some of the primary advances in AI-accelerated hardware from 2019.
In 2019, NVIDIA was at the forefront of developing hardware for AI at the edge. Their edge-dedicated Jetson platform saw an array of improvements and developments over the past year, including:
- Jetson Nano: In March, NVIDIA unveiled the Nano, the smallest edge AI device in the Jetson family of devices. This release came on hte heels of Google announcing its own Edge TPU accelerators. The Jetson Nano features a 128 core Maxwell GPU and a quad core ARM Cortex CPU. The Nano starts at just $99 with a “production ready” unit available for $129.
- Jetson Xavier NX: The latest module in the Jetson line hasn’t been released yet (March 2020), but its unveiling in November included some pretty impressive specs: it’s the size of a credit card and can achieve 21 TOPS while consuming as little at 10 watts of power.
- EGX platform: In May, NVIDIA announced EGX, their dedicated platform for edge computing. A wide range of offerings within: a cloud native GPU Operator for automated deployment, a range of GPU-accelerated software, and an architecture that is optimized for GPU workloads.
When it comes to neural processing units (NPUs — the processors that power ML on smartphones), Arm is one of the major players. Their team and leadership have openly discussed their intentions to take a bigger piece of the mobile chip market, and their advances in 2019 do nothing to push back against that claim. Here are a few notable releases:
- Cortex-A77: This CPU was released less than a year after its predecessor, the Cortex-A76 (which powered Samsung’s Galaxy S10 phones). Compared to the A76, the A77 provides 20% higher single-thread processing power and can perform floating point ops 35% faster.
- Mali-G77: This is Arm’s most recent mobile-first GPU, and benchmarks have shown no less than 40% higher peak performance than Arm’s previous GPU generation (Mali-G76), while consuming 30% less power.
- Arm ML Processor (2nd gen): Born from Arm’s “Project Trillium”, the 2nd iteration of their AI-dedicated processor allows manufacturing partners to include up to 8 processing cores in a single chip, for a max speed of 32 trillion TOPS (~8x increase from previous gen).
Qualcomm is regularly at the bleeding edge of AI-accelerating hardware, and 2019 was no different. Their Snapdragon chip series is built for mobile processing, and we’re starting to see performance benchmarks that rival Apple’s A13 Bionic chip, which has been the best in class for a while now.
- Snapdragon 865: The latest mobile-first engine from Qualcomm. The 865 was unveiled in December and is purportedly built for 5G smartphones. Qualcomm is claiming a 25% performance boost in core compute and a 4x increase in TOPS (as well as higher efficiency). Perhaps unsurprisingly, the 865 is comprised of a quad-core Cortex-A77 from Arm, as well as a block of Cortex-A55s (lower-power Arm cores).
- QCS400: In its effort to speed up the evolution of smart audio systems, Qualcomm also in March released their new QCS400 SoC Series, which supports a smarter UI, AI-based local automatic speech recognition, AI acceleration for lower-latency audio distribution, and more. Additionally, Qualcomm also released their Smart Audio Platform 400 to accompany the new SoC, making it easier for smart audio and voice assistant developers to work with the new chipset.
Of course, those mentioned above (plus Apple’s A13 Bionic and Google’s Edge TPU, which we discussed previously) are the primary payers in the AI-first hardware marketplace. But there a number of up-and-coming hardware companies that are attempting to capitalize on the momentum behind edge AI.
- Hailo: Hailo-8 is purportedly “smaller than a penny” and is capable of 26 TOPS. And though this AI-dedicated chip isn’t generally available yet, internal benchmarks claim that it outperforms NVIDIA’s Xavier AGX (nothing on how it compares to the NX yet), running inference at a faster FPS while using markedly less power (1.67 total watts to NVIDIA’s 32).
- Xnor.ai: In February, edge computing firm Xnor.ai announced that it’s made a solar power-enabled device that can deploy SOTA deep learning architectures without requiring any other power source/battery. It comes equipped with a camera mounted to a field programmable gate array (FPGA). Until recently, Xnor.ai had powered the Wyze home camera system, which featured on-device person detection.
- AIStorm: Also in February, AIStorm raised a little more than $13 million series A for its own edge AI chips. In terms of performance, AIStorm claims their chip can 10 TOPS per second per watt, which is significantly lower than the average GPU-based system’s power usage. One unique feature to AIStorm’s solution is the use of what’s called “switched charge processing”, which allows the chip to ingest and process data without first digitizing it.
- Flex Logix: In April, Flex Logix launched the InferX X1, a coprocessor built for edge AI inference. Like the others, the claimed performance is impressive, with the chip delivering 10x the throughput of other edge inference chips (the source of this comparison remains unclear). The InferX X1 also works particularly well with low batch sizes, which are necessary in edge apps where you’re typically working with one camera or sensor.
Federated Learning (FL) was introduced in research from Google in 2017, but in 2019 we saw some very important advances in this unique training approach. In Google’s own word, FL is “a distributed machine learning approach which enables model training on a large corpus of decentralized data.”
This decentralized data often comes from a network of mobile devices, meaning that model training doesn’t have to happen on a single machine or datacenter. As with many centralized systems, traditional single-machine training systems threaten user privacy, particularly on mobile devices. In a centralized training system for mobile models, users have to agree to send their data to these centralized systems (often the cloud) owned by AI companies.
FL enables mobile phones spread across a range of geographical areas to collaboratively and privately learn ML models. So in theory, FL can embed higher-performing ML models on-device with increased data privacy. A win-win for mobile ML, certainly.
In 2019, we saw a huge development in FL, as Google released their first production-level FL platform within TensorFlow:
- TensorFlow Federated (TFF): TFF is an open source implementation of FL. This implementation allows developers to express ML architectures of their choice using a dedicated TFF API.
Since this announcement in March at the TensorFlow Dev Summit, we’ve seen an increase in research efforts centered on FL.’
Swift for TensorFlow
For years, Python has been the dominant language of the machine learning world. However, that’s changing, as the advent of Swift for TensorFlow has exposed Swift to the ML world.
Our own ML engineer Jameson Toole and lifelong Python-er had the following to say about Swift as a machine learning language:
The first inklings of TensorFlow’s support for Swift came at their Dev Summit in 2018, and at the 2019 event, TensorFlow announced further support, including:
- Support for Swift for TensorFlow in Google Colab notebooks.
- Ability to seamlessly load and mix C and Python code with Swift.
- A partnership with fast.ai to help develop new units and course materials focused on using Swift for machine learning in the TensorFlow ecosystem.
Multi-Level Intermediate Representation
One of the biggest challenges in machine learning is how difficult it is to optimize models for multiple platforms. Models are being deployed from the tiniest microcontrollers to the largest TPU clusters. Getting things running quickly and efficiently on every chipset and platform is a tough job. To help, the TensorFlow team released a new intermediate representation and compiler to take a single TensorFlow graph and generate compiled code optimized for any hardware. The project is just in its infancy, but we think it will be incredibly important down the road.
The TensorFlow team did a nice job of explaining some of the specific benefits on their blog:
- Compiler researchers and implementers looking to optimize performance and memory consumption of machine learning models
- Hardware makers looking for a way to connect their hardware to TensorFlow, such as TPUs, portable neural hardware in phones, and other custom ASICs
- People writing language bindings that want to take advantage of optimizing compilers and hardware acceleration.
Neural Network Compression, Distillation, and Optimization
As software and hardware dedicated to ML at the edge expand, so too does the amount of research we see in optimizing ML models for deployment on those edge devices. Our own Jameson Toole wrote a thoughtful piece addressing the need for more research in these fields, which I’d encourage you to check out:
While the meat of the research is quite a bit over my head, there are a couple of basic areas in which we’ve seen significant progress over the past year:
- Model Pruning
- Knowledge Distillation
Neural networks have traditionally and are typically stored as 64- or 32-bit floating point numbers. Quantization is a process that maps each of these floating point weights to an integer with a lower bit width. So, for example, if we quantize a model from 32-bit floating point weights to 8-bit, we reduce the storage size by a factor of 4. And in turn, this reduces the overall size of our model, and an increasing body of research suggests that there is often little-to-know effect on performance. Quantization is now baked into both Core ML and TensorFlow Lite, and it should continue to be a key part of model compression efforts moving forward.
In 2019, there were a number of research efforts around quantization, but none bigger than some very important model compression research from Facebook AI:
With almost all neural networks, we can assume that many or even most of their weights aren’t particularly useful in optimizing network performance. One method, then, of compressing neural networks to fit on the edge is through model pruning—i.e. removing unnecessary values in the weight tensor. This can be done at the weight level, layer level, or block level.
2019 saw a lot of excellent research around pruning in a bunch of different contexts and for numerous tasks. One of our contributors Derrick Mwiti wrote this excellent research guide, which covers many of the important research papers centered on model pruning over the past year:
Knowledge distillation is a model compression technique in which a smaller “student” neural network is essentially taught by a larger, already-trained neural network. This kind of learning is another important optimization method that helps fit powerful neural networks on-device, helping them shrink in size while maintaining accuracy.
Like with model pruning, Derrick Mwiti also wrote a research guide, with many of the papers covered published and/or presented in 2019:
One particular areas where there’s been a lot of research around knowledge distillation is in language models and transformers. Take, for example, DistilBERT from our friends at Hugging Face (paper included in Derrick’s review). Their accompanying blog post is also an excellent resource for better understanding how and why knowledge distillation works:
One last news story
To conclude, I wanted to add an industry tidbits that we think is particularly relevant but didn’t quite fit into either of the two sections above.
The Data Privacy Debate
One of the hottest of hot-button issues in the machine learning world—and in several cases the mobile ML world—centered on data privacy. From reckoning with facial recognition’s invasive evolution to continued debates about working with highly sensitive data (medical data, for instance), users are beginning to care more and more about how machine learning applications (and the companies behind them) are using their data.
I wrote about this controversy in more detail here, but the takeaway from that article, at its core, is simple:
And that’s the one of the core arguments behind embedding ML models on device, and a primary reason why we believe the future of machine learning will be on-device.
*Takes deep breath*
Wow. It’s been a busy year in mobile and edge AI, and I honestly didn’t expect to cover this much ground when I started writing this. I know there are things I didn’t cover here (TensorFlow.js + React Native, burgeoning use cases like cashier-less stores and beautification, etc.), but hopefully this is a helpful place to start as we head into a new decade and a new frontier of ubiquitous on-device ML.
As always, drop a note if you have any industry updates to add, perspective to share, etc. We’d love to hear from you.