MACE: Deep learning optimized for mobile and edge devices

As we make progress in the era of edge computing, the demand for machine learning on mobile and edge devices seems to be increasing quite rapidly.

ML-enabled services such as recommendation engines, image and speech recognition, and natural language processing on the edge (to name a few) are growing, as is the need for processing large amounts of data with reduced latency.

To support inference in real-time, when connectivity is unreliable or latency is important, and when security/privacy is a concern, machine learning frameworks optimized for mobile and edge devices are proving to be a lifeline.

In recent years, big players such as Google, Apple, and Facebook have all launched their best efforts to come up with an answer to the perfect bridge between training a model in the cloud and deploying it for inference on mobile devices.

A few examples of such frameworks, whose purpose is to integrate machine learning models with mobile applications are Core ML, PyTorch Mobile, TensorFlow Lite, Firebase’s ML Kit, and so on.

Not that long ago, Chinese tech giant Xiaomi, notable for its budget smartphones and edge devices particularly in the Asian market, also jumped onto the bandwagon by supporting AI integration in smartphones with their open source deep ML platform MACE.

Introduction

Xiaomi initially unveiled MACE back in December 2017 as an open source project to support developers in their endeavors to add AI-based features to their apps. Designed to optimize built-in chip accelerators to support AI-based tasks (in particular photography), MACE also provides support for micro-controllers, primarily used in IoT (Internet of Things) devices with low power consumption that perform edge computing.

In particular, MACE addresses the following issues:

Greater compatibility with edge computing by being able to extend support to multiple core architectures, thus maximizing performance.
Decreased loss of performance that arises when switching between different systems.
Additional support for CMake, an open source tool for managing software creation using independent compilers.

MACE operates under Apache License 2.0 and draws inspiration from several open source AI projects—in particular the Qualcomm Hexagon NN Offload Framework, TensorFlow, Caffe, ONNX, and the ARM ComputeLibrary.

Architecture

The MACE documentation showcases the basic architecture that lies underneath the inference engine. The MACE model is defined as a customized model format, similar to Caffe2. The model can be converted from exported models by TensorFlow, Caffe, or ONNX.

The MACE Model Zoo is an open source project that hosts different models that find their way in everyday AI tasks, such as ResNet, MobileNet, FastStyleTransfer, and Inception. The repository contains several common neural networks and models against a list of mobile phones.

The MACE interpreter performs the job of parsing the neural network and managing the tensors in the graph, while the CPU/GPU/DSP runtime correspond to the ops for different devices.

With MACE, you can either build and run a model provided in the Model Zoo, or use your own already-trained model. The process flow that occurs within the MACE infrastructure is as follows:

Model deploy configuration file (.yml) describes the information of the model and library. MACE will build the library based on the file.
Using the .yml file to build MACE libraries. The build can be dynamic or static.
Converting a TensorFlow, Caffe, or ONNX model to a MACE model.
Integration of the MACE library into the written application and running the app with the MACE API.
After deploying the application, use the CLI to run the application. MACE provides a mace_run command line tool, which can be used to run model and validate model correctness against original TensorFlow or Caffe results.

Design Decisions

Like other deep learning frameworks for mobile devices, MACE was designed to be specifically optimized for the characteristics of such devices, thereby ensuring that offline deep learning applications are able to perform more smoothly.

High Speed

Deep neural networks such as CNNs take GPUs days of compute time to train on large datasets. Several applications of such networks require that inference be performed with little to no delay.

For example, pedestrian detection for self-driving cars necessitates very low latency. On the other hand, image recognition for mobile phones is constrained by limited processing resources.

The success of convolutional neural networks in these situations is limited by how fast we can compute them. MACE optimizes runtime with NEON, OpenCLm and Hexagon.

The initialization is also optimized to be faster. At the bottom of the framework, NEON instruction-level optimization was implemented for ARM CPUs, and efficient OpenCL kernel code was implemented for mobile GPUs.

For Qualcomm DSPs, the nnlib library is integrated for HVX acceleration. At the same time, the Winograd algorithm is used to accelerate the convolution. The algorithm is used to compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes.

Power consumption

Energy savings and limiting power consumption are paramount when carrying out deep learning tasks on mobile and edge devices. MACE combines high performance and low power consumption by utilizing chip-dependent power options like big.LITTLE scheduling.

The big.LITTLE architecture in ARM chips is specifically designed to address the conflicting requirement of powerful performance and energy-efficient processing. The big.LITTLE processor works by assigning each task according to its nature.

If a task consumes proportionally-high CPU resources, the scheduler assigns this task to the big core, otherwise it gets assigned to LITTLE core. Likewise, for Adreno GPUs, different power performance options are offered.

Responsiveness

Sometimes, it’s obligatory for the system UI to guarantee a response when running a model. MACE introduces mechanisms such as automatically breaking the OpenCL kernel into small units to allow better preemption for the UI rendering task.

For the GPU computing mode, the bottom layer of the framework splits the OpenCL kernel adaptively to ensure that the GPU rendering task can perform preemptive scheduling better, thus ensuring the fluency of the system.

Memory usage

MACE supports graph-level memory allocation optimization and buffer reuse. This kind of memory reuse technology was introduced through a dependency analysis of the model’s operators, which greatly reduced the memory footprint. The core library attempts to minimize external dependencies in order to keep the library footprint small.

Model protection

The MACE framework has emphasized model protection as its highest priority in all aspects of its design since inception. The developers have made use of various techniques to this effect, such as converting models to C++ code, which exponentially increases the complexity of reverse engineering, and literal obfuscations.

Diverse platform coverage

Besides its other advantages, MACE offers good coverage of the latest chipsets such as Qualcomm, MediaTek, Pinecone, and other ARM-based chips. The CPU runtime supports Android, iOS, and Linux. MACE provides support for various rich model formats like TensorFlow, Caffe and ONNX.

Usage

The MACE documentation provides examples using pre-trained models for carrying out inference tasks with MACE. One can either build and run a pre-existing model provided in the MACE Model Zoo or alternatively, work with their own already-trained model. Here’s a test case for building on the MobileNet-v2 model from the Zoo.

Pull MACE project.
Pull the MACE Model Zoo project.
Build a generic MACE library.
Convert the pre-trained MobileNet-v2 model to a MACE format model.
Run the model.
Deploy your model into applications

You could run model on a CPU, GPU, or DSP (based on the runtime in your model deployment file). However, there are some differences in different devices.

CPU — Almost all mobile SoCs use ARM-based CPU architecture, so your model could run on different SoCs in theory.
GPU — Although most GPUs use the OpenCL standard, there are some SoCs that aren’t fully compliant with the standard. Or, the GPU is too low-level to use. So you should have some fallback strategies when the GPU run fails.
DSP — MACE only supports the Qualcomm DSP. And you need to push the Hexagon NN library to the device.

Performance (Benchmarks)

As illustrated in the table above, MACE has quite a few benefits to offer as a deep learning inference engine optimized for mobile devices, most notably speed and security.

A benchmark test of three Android-compatible ML frameworks, namely TF Lite, MACE and SNPE (Snapdragon Neural Processing Engine) show the average inference latency on a MobileNet-v1 model. Note that a TFLite GPU employs OpenGL for the widest coverage with reasonable performance.

MACE and SNPE employ OpenCL and may outperform the TFLite GPU on some mobile devices shipped with OpenCL. Since OpenCL is not a part of the standard Android distribution ecosystem, apps using those frameworks may not be able to guarantee their inference performance. For example,Google Pixel devices do not support OpenCL. Also, ARM Mali GPUs are incompatible with SNPE.

Conclusion

MACE is a highly-optimized deep learning compute engine targeted specifically for mobile devices and does not disappoint with its plethora of benefits. It’s embeded in many MIUI applications and acts as a generic feature rather than a standalone library deployed for AI-related tasks.

The MIUI 12 update boasts of several MACE-powered features; in particular, the camera, whose AI mode consumes a significantly lower percentage of battery than previous versions.

“Xiaomi’s AutoML model now leads the industry by dataset performance; and MiNLP, the company’s natural language processing platform, is activated over 6 billion times on a daily basis, making Xiao AI one of the world’s busiest AI platform,” said Cui Baoqiu, Xiaomi’s Head of AI, referring to the Xiao AI 3.0, the built-in voice assistant for Mi devices.

Xiaomi has also made the open source MACE-Kit publicly available for AI development enthusiasts. MACE-Kit provides various APIs for usage services, such as face detection and recognition, person segmentation, speech recognition, and so on.

Developers can use these APIs to deploy all sorts of models and applications with ease.