If you got past the jargon of the title, you probably have at least a passing interest in computer vision. However, fear not! This is going to be a fairly gentle walk-through of some of my projects at the intersection of Machine Learning and Augmented Reality. They all share a common denominator: feature point extraction. InstaSaber, Say BARK!, and the puppet videos you see below are a few examples:
If my face in the video doesn’t give it away, I only work on SERIOUS projects. The dots overlaid at the joints of the hand are where everything begins. These come from the 2D coordinate predictions of a Convolutional Neural Network (CNN). The CNN model takes in an image (cropped at the blue square) and outputs an array of normalized (X,Y) coordinates.
If you’re familiar with a traditional classification model (e.g. is this image a 🐈 or a 🐕), writing a 2D point extraction model can be surprisingly similar, just without an activation layer at the end. For example, you might remove this “SoftMax” layer the end of a Keras classification model:
Instead of outputting N classes, you get N/2 coordinate pairs. For example, if you wanted to get 3 points as your output, your output layer might look like (X1, Y1, X2, Y2, X3, Y3). The only difference from a classification model is that our labels are normalized coordinates in the image instead of classes, and our output isn’t being “squashed” into a probability vector. Of course, to do any of this, you need some labeled data!
I’ve always created and annotated my own datasets, either manually with real-world images, or synthetically, like the disturbing video above, but this is usually far too time-consuming and costly for most ML practitioners. I recommend starting with an existing dataset like this one.
2D points are great, but what about 3D?
Turns out, you can use a little trick called SolvePnP to estimate the 3D pose of an object. If you know where those 2D points would project onto a 3D model of the object you are tracking, you can pass all those points into an OpenCV function. As you can see in the video to the left, I’m estimating the 3D position and orientation of the piece of paper.
Since I know it to be roughly 8.5 inches long, and 1–2 inches in diameter (cue Michael Scott), I have known 3D model points. My predictions give me the corresponding 2D points, and I can now solve with the SolvePnP function.
Note this only works for rigid 3D shapes. For example, this would be difficult to do with the first video of the hand points, because a hand can deform into over a billion shapes. There are other ways to estimate 3D coordinates, but in terms of implementation, this is probably the simplest method.
Latency and the need for speed
Ultimately, none of this really matters if you can only get 10 frames per second of inference on your phone. Users expect apps to run fluidly and not turn their phone into a space heater. Optimizing for mobile has two key components:
- Efficient models
- Efficient GPU/CPU synchronization.
In terms of the former, there’s no shortage of mobile-optimized models out there to work from (like Google’s MobileNets). For the latter (at least on iOS), Matthijs Hollemans has done quite a bit of work in this arena, and I highly recommend looking at his work with Metal Performance Shaders.
Where do I go from here?
Start small and work your way up. There are a lot of moving pieces to everything I’ve covered, and you will end up frustrated if you set overly ambitious goals. If you’ve got any questions, feel free to email me at [email protected].
Thanks for reading and may the force be with you!