CVPR 2020: Research with Mobile ML Implications

There was no shortage of excellent computer vision research presented during the recently-concluded 2020 CVPR conference. In this article, we’ll take a focused look at a couple of those that touched or have implications for mobile or edge-related tasks.

While not all of these papers connect directly to mobile-first applications, their implications for mobile ML are significant. They push forward ML tasks commonly performed on mobile and edge devices, so their advancement is crucial in pushing the industry forward.

Perceptual Quality Assessment of Smartphone Photography

Authors of this paper performed an in-depth study of the perceptual quality assessment of smartphone photography. They also introduced the Smartphone Photography Attribute and Quality (SPAQ) database. The database contains 11,125 pictures captured by 66 smartphones. Each of the images has rich annotations.

The authors also collected human opinions for each image. Some of the information collected includes image quality, image attributes, image attributes, and scene categories labels. For deeper analysis, they also recorded the exchangeable image file format (EXIF) for each image.

They then used the database to train blind image quality assessment (BIQA) models constructed by baseline and multi-task deep neural networks. The results obtained give insights into:

how EXIF data, image attributes, and high-level semantics interact with image quality
how next-generation BIQA models can be designed
how better computational photography systems can be optimized on mobile devices

The BIQA models and the database can be found here:

Here’s a comparison of image quality assessment — IQA — databases of camera distortions.

Category-Level Articulated Object Pose Estimation

In this paper, the authors dove into the task of category-level pose estimation for articulated objects from a single-depth image. They proposed a category-level approach that includes object instances previously unseen during training. They introduced a canonical representation for different articulated objects in a given category — Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH)

The representation builds canonical object spaces and a set of canonical part spaces in order to attain intra-category generalization. Further, the authors developed a deep network-based on PointNet++.

The network predicts ANCSH from a single-depth point cloud. Utilization of canonical joints enabled the authors to achieve:

improved performance in part pose and scale estimations using induced kinematic constraints from joints
high accuracy for joint parameter estimation in camera space

The figure below shows a comparison of joint parameter estimation.

KFNet: Learning Temporal Camera Relocalization Using Kalman Filtering

Unlike one-shot relocalization that estimates pose by focusing on a still image, temporal camera relocalization estimates the pose with respect to each video frame in the sequence. In this work, the authors aim at improving the pose estimation capacity of temporal relocalization methods.

They do so by using a network architecture that incorporates Kalman filtering (KFNet) for online camera re-localization. The network design and loss are based on Kalman filtering in the context of Bayesian learning.

KFNet bridges the performance gap between temporal and one-shot relocalization methods. The authors also propose a statistical assessment tool to enable KFNet to self-inspect the potential outlier predictions on the fly.

Below is a visual of the median translation and rotation errors of different relocalization methods.

Here’s a comparison of the accuracy between one-shot and temporal.

EventCap: Monocular 3D Capture of High-Speed Human Motions Using an Event Camera

This paper proposes EventCap, an approach for 3D capture of human motions at high speed, using a single camera. High-frequency motion details are captured using model-based optimization and CNN-based human pose estimation.

The authors were able to capture fast motions at millisecond resolution, with higher data efficiency as compare to the use of high frame rate videos.

The authors design a motion capture algorithm that leverages the event stream and the low frame rate intensity image stream from the event camera in a joint optimization framework. This happens in the following stages:

Tracking of events in 2D space in an asynchronous manner, and the reconstruction the continuous spatio-temporal event trajectories between each adjacent intensity image frames.
Estimation of the 3D motion of the human actor using a batch-based optimization algorithm.
Refining the captured high-speed motion based on the boundary information received from the asynchronous event stream.

Drifting is solved by leveraging the tracked event trajectories and the CNN-based 2D and 3D pose estimation from the intensity images.

The authors also propose an evaluation dataset for event camera-based fast human motion capture.

Here’s a comparison of several methods in terms of tracking accuracy and data throughput as compared to the method proposed in this paper.

RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild

The authors present RetinaFace, a single-shot, multi-level face localization method. They also manually annotate five facial landmarks from the WIDER FACE dataset and employ a semi-automatic pipeline to generate 3D vertices for face images from WIDER FACE, AFLW, and FDDB datasets.

The authors combine face bounding box predictions, 2D facial landmark localization, and 3D vertices regression in order to achieve point regression on the image plane.

They also propose a learning strategy to train a unified, multi-level face localization method that predicts face bounding boxes, five 2D facial landmarks, and 1K 3D vertices, all at the same time.

FReeNet: Multi-Identity Face Reenactment

This paper proposes, FReeNet, a multi-identity face re-enactment framework. It transfers the facial expressions from a source face to another face with a shared model. The proposed framework is made up of two parts:

A Unified Landmark Converter (ULC) that takes the architecture of an encoder-decoder to convert expressions in a latent landmark space. This reduces the gap of the face contour between the source and the target image.
A Geometry-aware Generator (GAG) uses the converted landmark to re-enact the photo-realistic image, given the image of a target human face as a reference.

The authors also propose a novel triplet perceptual loss that forces the GAG module to learn appearance and geometry information at the same time. This enriches the facial details of the re-enacted images. The loss is formed as a result of combining the triplet loss and the perceptual loss.

There are two main challenges with face re-enactment—first, the conversion of multi-identity facial expression by a unified network, and the re-enactment of photo-realistic and identity-consistent target faces—while maintaining the pose, hue, and illumination at the same time.

Gaps in facial contours between the source and the target face also pose serious challenges. In order to address these challenges, the authors use a landmark detector to encode faces into latent landmark space.

A landmark converter module is used to convert the source image to the target person in latent landmark space. The geometry-aware generator extracts geometry information from the converted landmark space, and appearance information from the converted landmark.

It also extracts appearance information from the reference person to re-enact the target face.

Here are some results obtained from the proposed method:

Final Thought

If these papers caught your interest, then you’re in luck! There’s plenty more where this came from, in case you’d like to explore further: