How to Capture the Best Frame in an iOS Image Processing App

If you’ve ever developed an iOS Vision app that process frames of a video buffer, you know that you need to be careful with your resources. You shouldn’t process each frame—i.e., where the user just moves the camera around.

In order to classify an image with high accuracy, you’ll need to capture a stable scene. This is crucial for apps that use Vision. In this tutorial, I’ll be diving into this problem and the solution Apple suggests.

Basically, there are two options in capturing an image:

  1. Let the user capture the image with a button.
  2. Capture the best scene (frame) programmatically from the continuous video buffer.

With the first option, you let the user capture the stable scene. Remember the infamous hot dog app from Silicon Valley—in this app, the user takes the picture with a button. Capturing the best scene is left in the users’ hands.

In the second option, your app should take care of capturing the best scene whenever the user is moving the camera around.

That’s where we start begging:

What Is Image Registration?

Image registration is the determination of a geometrical transformation that aligns points in one view of an object with corresponding points in another view of that object (or another object).

Registration works like this: we’re aligning two images with each other, and the algorithm tells you, “Okay, if you shift it by this many pixels, this is how they would actually match.”

As Frank Doepke stated at WWDC 18, this is a pretty cheap and fast algorithm, and it will tell me if I hold the camera still or if anything is moving in front of the camera. Vision apps could hypothetically make a classification request on every frame buffer, but classification is a computationally expensive operation — so attempting this could result in delays and poor performance with the UI.

So Apple suggests classifying the scene in a frame only if the registration algorithm determines that the scene and camera are still, indicating the user’s intent to classify an object.

How To Measure Relative Distance Between Images?

VNTranslationalImageRegistrationRequest allows developers to check whether the current image from a video buffer is worth spending Vision resources. iOS camera-accessing apps use the captureOutput:didOutputSampleBuffer:fromConnection: delegate method in order to process video frames.

In this delegate method, we’ll call the registration request, as shown in the code below. This request is an image analysis request that determines the affine transform needed to align the content of two images:

let registrationRequest = VNTranslationalImageRegistrationRequest(targetedCVPixelBuffer: pixelBuffer)
        do {
            try sequenceRequestHandler.perform([ registrationRequest ], on: previousPixelBuffer!)
        } catch let error as NSError {
            print("Failed to process request: (error.localizedDescription).")

Here we use sequenceRequestHandler (VNSequenceRequestHandler) with VNTranslationalImageRegistrationRequest objects to compare consecutive frames, keeping a history of the last 15 frames.VNSequenceRequestHandler is an object that processes image analysis requests for each frame in a sequence (15 frames in this case).

This algorithm accepts a scene as stable if the Manhattan distance between frames is less than 20:

/// - Tag: CheckSceneStability
    fileprivate func sceneStabilityAchieved() -> Bool {
        // Determine if we have enough evidence of stability.
        if transpositionHistoryPoints.count == maximumHistoryLength {
            // Calculate the moving average.
            var movingAverage: CGPoint =
            for currentPoint in transpositionHistoryPoints {
                movingAverage.x += currentPoint.x
                movingAverage.y += currentPoint.y
            let distance = abs(movingAverage.x) + abs(movingAverage.y)
            if distance < 20 {
                return true
        return false

So what is the Manhattan distance? For example, if 𝑥=(𝑎,𝑏) and 𝑦=(𝑐,𝑑) the Manhattan distance between x and y is |𝑎−𝑐|+|𝑏−𝑑|.

After checking the Manhattan distance, we check the results of the image registration request. It takes the result of a request as alignmentObservation.alignmentTransform to determine if the scene is stable enough to perform classification. The RecordTransposition function adds data to the transpositionHistoryPoints stack to record the results of the last 15 frames.

if let results = registrationRequest.results {
            if let alignmentObservation = results.first as? VNImageTranslationAlignmentObservation {
                let alignmentTransform = alignmentObservation.alignmentTransform
                self.recordTransposition(CGPoint(x: alignmentTransform.tx, y: alignmentTransform.ty))

The SceneStabilityAchieved function controls the results of the last 15 frames to detect stability. If we achieve scene stability for these frames, then we can analyze this frame and pass it to Core ML for the best classification results.

if self.sceneStabilityAchieved() {
    if currentlyAnalyzedPixelBuffer == nil {
        // Retain the image buffer for Vision processing.
        currentlyAnalyzedPixelBuffer = pixelBuffer
} else {


In this tutorial, we learned how to use image registration and measure the Manhattan distance between images in order to achieve scene stability. This will allow our Core ML Vision apps to work better and use fewer resources.

Find the full code of the view controller here.

Thanks for reading!

If you liked this story, you can follow me on Medium and Twitter. You can contact me via e-mail.


And more on image registration from Fitzpatrick, Hill, and Maurer.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square