Advancements in Apple’s Vision Framework

Exploring the new goodies that pushed the Vision framework forward by leaps and bounds

In 2019 Apple introduced some really exciting features and improvements to its Vision framework. Through these changes, Apple showed us that alongside on-device machine learning, computer vision is an equally important part of their arsenal for mobile developers looking to build smart and intelligent applications.

For readers new to the Vision framework, it aims to provide a high-level API for complex computer vision algorithms, as well as act as a catalyst for Core ML models.

Let’s take a closer look at the enhancements introduced in the Vision framework over the past year. I got a chance to implement some of the new goodies that were introduced at WWDC 2019, so you’ll find relevant links to them attached in each section as you walk through this piece.

Vision’s Own Document Camera

A small new framework named VisionKit was introduced with iOS 13, which allows us to use the system’s document camera. By using the new VNDocumentCameraViewController, we can scan documents just like the “Notes” and “Files” apps do. And we can get the results back in our view controller classes using the callbacks received from the VNDocumentCameraViewControllerDelegate protocol methods.

Vision’s New Text Recognition Request

In iOS 12, Vision allowed us to detect rectangles containing text. The actual text characters within those rectangles couldn’t be identified in the Vision request. To identify the actual texts, we had to fall back to custom Core ML models, as done in this implementation:

Conveniently, with iOS 13 Apple gifted Vision with the new VNRecognizeTextRequest, a subclass of VNImageBasedRequest that lets us identify text from images. The request returns an array of VNRecognizedTextObservations from which we can identify the top candidate and the string it contains. The following code snippet represents a gist of how the new text recognition request in vision works:

Additionally, with this new request we can set the accuracy level of optical character recognition (OCR) and pass our own list of custom words—quirky words or brand names that usually aren’t part of dictionaries.

Here’s a working example that uses the Document Scanner and the Text Recognition Request together to identify text from scanned images:

Vision’s Face Technology Bolstered

The Face Detectors saw a number of improvements as well. Let’s breeze through them:

  • The number of landmark points detected on the face has increased from 65 to 76.
  • Instead of a single confidence score for the face, each point gets it’s own confidence/estimation score as well.
  • The pupil detection is much improved.
  • A new animal detector that classifies cats and dogs.
  • A new face capture quality vision request that’ll soon be found in all selfie based apps.

Amongst the newer releases in Vision’s face technology, we have an implementation for two of them that stood out.

Built-in Animal Detector

Additionally, Vision received a new Animal Classifier that classifies all kinds of cats and dogs with pretty good accuracy. VNRecognizeAnimalsRequest is the new Vision request that lets us to identify cats and dogs in images and videos, returning a bounding box for the identified region as well for object detection use cases.

Classifying and detecting animals using Vision in iOS 13 can be done very quickly, as the following implementation showcases:

Face Capture Quality Request

For me, the new Face Capture Quality Vision request is near the top of the list among the year’s improvements.

Based on quite a few metrics, such as exposure, pose, blurriness, facial expression, and mood depicted, the VNDetectFaceCaptureQualityRequest returns a VNFaceObservation instance that holds the faceCaptureQuality value. Using this we can create quite a few interesting use cases, such as picking the best portrait photo from a bunch of images.

Here’s an implementation of the Face Capture Quality Request that determines the best frame from a Live Photo in iOS:

Vision Saliency

Vision’s new Saliency request is responsible for highlighting the prominent features in images and videos.

Two new vision requests were introduced: VNGenerateAttentionBasedSaliencyImageRequest for attention-based saliency — basically what grabs the human eye’s attention first.

Objectness-based saliency, on the other hand, looks for prominent objects in the frame.

Saliency has a lot of interesting use cases: photo segmentation, anomaly detection, detecting blurred images (blurred images have less salient features in general), and even generating thumbnail images with automated cropping.

For an in-depth look at saliency and to learn how to crop areas of interests in an image, refer to the following piece:

Vision Image Similarity

Image Similarity uses feature prints — basically vector representations to describe the images.

Feature prints don’t rely on the pixels of the image entirely, rather more emphasis is given to the overall context of the image to get a kind of semantic meaning of the image which the vector presents. Images with similar contexts have closer vector distances(nearest neighbors) and are similar.

Image Similarity can have plenty of use cases from grouping images to identifying duplicate images. The following piece takes an in-depth look with implementation at the new Vision Image Similarity request:

A Built-In Image Classification Request

Creating a diverse image classification model requires a lot of images and hours of training time. Luckily, Apple now provides a built-in multi-label classification model that’s wrapped inside the VNClassifyImageRequest. The classification request consists of around 1000 classes. To know the taxonomy of the classifier, simply invoke the following function to get a list of all the classes.

Evaluation of multi-label classifiers is less straightforward, unlike binary classifiers that return a single label as the prediction. Using accuracy as the metric to evaluate a model doesn’t always represent how good or bad the model is. For multi-label classifiers, a false prediction of a target class is not a hard-lined right or wrong. Instead, more emphasis is given to a group of classes that are predicted for the image.

  • Recall — A metric is used to evaluate the overall relevancy of the model. When you aren’t too concerned with false predictions, you’d prefer a model with higher recall value.
  • Precision —A metric to measure the quality of the model. In cases where false positive isn’t catastrophic, a model with high precision is preferred.

The formulas for precision and recall are illustrated below:

The terms TP and TN are straightforward but it can get tricky to get a hang of their counterparts(FP and FN). We hope the following list clears the terminologies once and for all.

  • TP and TN — A true positive occurs when a model predicts a class label for the image and the image has that in real. A true negative is just the opposite of a TP. It occurs when a class label is predicted as not a part of the image, and it in fact doesn’t exist in the image.
  • FP — A false positive occurs when a model predicts a class label for an image, but the image does not contain that class. For example, labeling an image as a bike when there isn’t one would fall under the FP category.
  • FN — A false negative occurs when the class label(s) predicted is not a part of the image and it doesn’t exist in the image as well. For example, predicting a message not as a SPAM when it isn’t in reality falls under the FN category.

The new VNClassifyImageRequest’s VNClassificationObservation possesses an API to filter the results by precision or recall, whichever suits your use cases.

The following code showcases a way to filter by setting a specific recall value on the precision-recall curve.

The equivalent formula for specifying a precision along the recall value is:

In cases where the minimum recall (or precision, depending on which filter you use) that we specify is greater than zero, the precision value must fall in that valid region on the PR curve for the filter to work.

What’s Next?

We walked through the improved APIs and changes introduced in Apple’s Vision framework during 2019 and highlighted some example implementations to help you start working with the framework.

From improved face detectors to new features like saliency and a built-in classification request, the Vision framework has (nearly) everything covered to allow mobile developers to easily use machine learning classifier models and other computer vision algorithms easily in their applications.

Moving forward into a new decade, we’d love for a lot more interesting features and improvements in the Vision framework such as:

  • Handling trajectory detection, a really popular computer vision use case.
  • Customizing and extending Vision’s current classifier models with our own data sets.
  • Extending face capture quality requests to a more generalized image quality Vision request.
  • Movement tracking. Matching and comparing the movements of two people would be really interesting.
  • Enhanced object detection that detects movements and determines the distance of the objects from the camera could be really useful, especially in the automotive industry (pedestrian detection).

Hoping for some of these wishes to come true in 2020. Please do post your wishlists in the comment sections below. That’s a wrap for this one. I hope you enjoyed reading.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *