Comparing MobileNet Models in TensorFlow

In recent years, neural networks and deep learning have sparked tremendous progress in the field of natural language processing (NLP) and computer vision.

While many of the face, object, landmark, logo, and text recognition and detection technologies are provided for Internet-connected devices, we believe that the ever-increasing computational power of mobile devices can enable the delivery of these technologies into the hands of users anytime, anywhere, regardless of Internet connection.

However, computer vision for on-device and embedded applications faces many challenges — models must run quickly with high accuracy in a resource-constrained environment, making use of limited computation, power, and space.

TensorFlow offers various pre-trained models, such as drag-and-drop models, in order to identify approximately 1,000 default objects.

When compared with other similar models, such as the Inception model datasets, MobileNet works better with latency, size, and accuracy. In terms of output performance, there is a significant amount of lag with a full-fledged model.

However, the trade-off is acceptable when the model is deployable on a mobile device for real-time offline detection.

Let’s look at an example of how to use MobileNet. We will write a simple classifier to find Pikachu in an image. The following are sample pictures showing an image of Pikachu and an image without Pikachu:

Building the dataset

To build our own classifier, we need to have datasets that contain images with and without Pikachu.

Let’s start with 1,000 images on each database. You can pull such images here.

Next up, let’s create two folders named pikachu and no-pikachu and drop those images accordingly.

Another handy dataset containing images for all the generation one Pokémon can be found here.

Now we have an image folder, which is structured as follows:


Retraining Images

We can now start labeling our images. With TensorFlow, this job becomes easier. Assuming that TensorFlow is installed on the training machine already, download the following retraining script:


Next up, we’ll retrain the image with this Python script:

python \
–image_dir ~/MLmobileapps/Chapter5/dataset/ \
–learning_rate=0.0001 \
–testing_percentage=20 \
–validation_percentage=20 \
–train_batch_size=32 \
–validation_batch_size=-1 \
–eval_step_interval=100 \
–how_many_training_steps=1000 \
–flip_left_right=True \
–random_scale=30 \
–random_brightness=30 \
–architecture mobilenet_1.0_224 \
–output_graph=output_graph.pb \

Model conversion from GraphDef to TFLite

TOCO Converter is used to convert from a TensorFlow GraphDef file or SavedModel into either a TFLite FlatBuffer or graph visualization.
(TOCO stands for TensorFlow Lite Optimizing Converter.)

We need to pass the data through command-line arguments. There are a few command-line arguments that can be passed in while converting the model:

–output_file OUTPUT_FILE
Filepath of the output tflite model.–graph_def_file GRAPH_DEF_FILE
Filepath of input TensorFlow GraphDef.–saved_model_dir
Filepath of directory containing the SavedModel.–keras_model_file
Filepath of HDF5 file containing tf.Keras model.–output_format {TFLITE,GRAPHVIZ_DOT}
Output file format.–inference_type {FLOAT,QUANTIZED_UINT8}
Target data type in the output–inference_input_type {FLOAT,QUANTIZED_UINT8}
Target data type of real-number input arrays.–input_arrays INPUT_ARRAYS
Names of the input arrays, comma-separated.–input_shapes INPUT_SHAPES
Shapes corresponding to –input_arrays, colon-separated.–output_arrays OUTPUT_ARRAYS
Names of the output arrays, comma-separated.

We can now use the TOCO tool to convert the TensorFlow model into a TensorFlow Lite model:

toco \

Similarly, we can use the MobileNet model in similar applications; for example, in the next section, we’ll be looking at a gender model and an emotion model.

Gender Model

This model uses the IMDB WIKI dataset, which contains 500k+ celebrity faces. It uses the MobileNet_V1_224_0.5 version of MobileNet.

It is very rare to find public datasets with thousands of images. This dataset is built on top of a large collection of celebrity faces. There are two common places: one is IMDb and the other one is Wikipedia. More than 100K celebrities’ details were retrieved from their profiles from both sources through scripts.

Then it was organized by removing noise (irrelevant content). All the images without a timestamp were removed, assuming that images with a single photo are likely to show the person with correct birth date details. At the end, there were 460,723 faces from 20,284 celebrities from IMDb, and 62,328 from Wikipedia, for a total of 523,051.

Emotion model

This is built on the AffectNet model with more than 1 million images. It uses the MobileNet_V2_224_1.4 version of MobileNet.

The link to the data model project can be found here:

The AffectNet model is built by collecting and annotating facial images of more than 1 million faces from the Internet. The images were sourced from three search engines, using around 1,250 related keywords in six different languages.

Among the collected images, half of the images were manually annotated for the presence of seven discrete facial expressions (categorical model) and the intensity of valence and arousal (dimensional model).

Comparison of MobileNet Versions

In both of the above models, different versions of MobileNet models are used. MobileNet V2 is mostly an updated version of V1 that makes it even more efficient and powerful in terms of performance.

MACs are multiply-accumulate operations, which measure how many calculations are needed to perform inference on a single 224×224 RGB image.

From the number of MACs alone, V2 should be almost twice as fast as V1. However, it’s not just about the number of calculations. On mobile devices, memory access is much slower than computation. V2 only has 80% of the parameter count that V1 has hence making it better than V1.

By seeing the results we can assume that V2 is almost twice as fast as V1 model. On a mobile device when memory access is limited, the computational capability of V2 works very well.

In terms of accuracy:

Here MobileNet V2 is slightly, if not significantly, better than V1.


MobileNets are a family of mobile-first computer vision models for TensorFlow, designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application.

MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings, and segmentation, similar to how other popular large scale models, such as Inception, are used.

If you want to go ahead and fuel your curiosity, a bunch of pre trained models can be found here :

Also, here’s a blog post outlining how you can build a real like Pokémon classifier using MobileNets and TensorFlow Lite:

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.

If you found this article interesting, you can explore Machine Learning Projects for Mobile Applications. Written by Karthikeyan MG, an ML expert, Machine Learning Projects for Mobile Applications presents the implementation of 7 practical, real-world projects that will teach you how to leverage TensorFlow Lite and Core ML to perform efficient machine learning on a cross-platform mobile OS.

Want to start building amazing Android Apps? Check out my course on Coding Bocks:

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *