Deep Video Portraits

Synthesizing and editing video portraits—i.e., videos framed to show a person’s head and upper body—is an important problem in computer graphics, with applications in video editing and movie postproduction, visual effects, visual dubbing, virtual reality, and telepresence, among others.

The problem of synthesizing a photo-realistic video portrait of a target actor that mimics the actions of a source actor—and especially where the source and target actors can be different subjects—is still an open problem.

There hasn’t been an approach that enables one to take full control of the rigid head pose, face expressions, and eye motion of the target actor; even face identity can be modified to some extent. Until now.

In this post, I’m going to review “Deep Video Portraits”, which presents a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

In this post, I’ll cover two things: First, a short definition of a DeepFake. Second, an overview of the paper “Deep Video Portraits” in the words of the authors.

1. Defining DeepFakes

The word DeepFake combines the terms “deep learning” and “fake”, and refers to manipulated videos or other digital representations that produce fabricated images and sounds that appear to be real but have in fact been generated by deep neural networks.

2. Deep Video Portraits

2.1 Overview

The core method presented in the paper provides full control over the head of a target actor by transferring the rigid head pose, facial expressions, and eye motion of a source actor, while preserving the target’s identity and appearance.

On top of that, full video of the target is synthesized, including consistent upper body posture, hair, and background.

The overall architecture of the paper’s framework is illustrated below in Figure 2.

First, the source and target actors are being tracked using a state-of-the-art face reconstruction approach from a single image, and a 3D morphable model (3DMM) is derived to best fit the source and target actors.

The resulting sequence of low-dimensional parameter vectors represents the actor’s identity, head pose, expression, eye gaze, and the scene lighting for every video frame.

Then, the head pose, expressions and/or eye gaze parameters from the source are taken and mixed with the illumination and identity parameters of the target. This allows the network to generate a full-head reenactment while preserving the actor’s identity and look.

Next, new synthetic renderings of the target actor are generated based on the mixed parameters. These renderings are the input to the paper’s novel “rendering-to-video translation network”, which is trained to convert the synthetic input into photo-realistic output.

2.2 Face Reconstruction from a single image

3D morphable models are used for face analysis because the intrinsic properties of 3D faces provide a representation that’s immune to intra-personal variations, such as pose and illumination. Given a single facial input image, a 3DMM can recover 3D face (shape and texture) and scene properties (pose and illumination) via a fitting process.

The authors employ a state-of-the-art dense face reconstruction approach that fits a parametric model of the face and illumination to each video frame. It obtains a meaningful parametric face representation for both the source and the target, given an input video sequence.

The meaningful parametric face representation consists of a set of parameters P. , which could be denoted as the corresponding parameter sequence that fully describes the source or target facial performance.

The set of reconstructed parameters P encode the rigid head pose, facial identity coefficients, expressions coefficients, gaze direction for both eyes, and spherical harmonics illumination coefficients. Overall, the face reconstruction process estimates 261 parameters per video frame.

Below are more details on the parametric face representation and the fitting process.

The paper represents the space of facial identity based on a parametric head model, and the space of facial expressions via an affine model. Mathematically, they model geometry variation through an affine model v∈ R^(3N) that stacks per-vertex deformations of the underlying template mesh with N vertices, as follows:

Where a_{geo} ∈ R^(3N) stores the average facial geometry. The geometry bases b_k for the geometry has been computed by applying principal component analysis (PCA) to 200 high-quality face scans, and b_k for the expressions has been obtained in the same manner on blendshapes.

To render synthetic head images, a full perspective camera is assumed that maps model-space 3D points v via camera space to 2D points on the image plane. The perspective mapping Π contains the multiplication with the camera intrinsics and the perspective division.

In addition, based on a distant illumination assumption, spherical harmonics basis functions are used to approximate the incoming radiance B from the environment.

Where B is the number of spherical harmonics bands, ɣ_b the spherical harmonics coefficients, and r_i and n_i the reflectance and unit normal vector of the i-th vertex, respectively.

2.3 Synthetic Conditioning Input

Using the face reconstruction approach described above, a face is reconstructed in each frame of the source and target video. Next, the rigid head pose, expression, and eye gaze of the target actor is modified. All parameters are copied in a relative manner from the source to the target.

Then the authors render synthetic conditioning images of the target actor’s face model under the modified parameters using hardware rasterization.

For each frame, three different conditioning inputs are generated: a color rendering, a correspondence image, and an eye gaze image.

The color rendering shows the modified target actor model under the estimated target illumination, while keeping the target identity (geometry and skin reflectance) fixed. This image provides a good starting point for the following rendering-to-video translation, since in the face region only the delta to a real image has to be learned.

A correspondence image encoding the index of the parametric model’s vertex that projects into each pixel is also rendered to keep the 3D information.

Finally, a gaze map is provided to provide information about the eye gaze direction and blinking.

All of the images are stacked to obtain the input to the rendering-to-video translation network.

2.4 Rendering-To-Video Translation

The generated conditioning space-time stacked images are the input to the rendering-to-video translation network.

The network learns to convert the synthetic input into full frames of a photo-realistic target video, in which the target actor now mimics the head motion, facial expression, and eye gaze of the synthetic input.

The network learns to synthesize the entire actor in the foreground, i.e., the face for which conditioning input exists, but also all other parts of the actor, such as hair and body, so that they comply with the target head pose.

It also synthesizes the appropriately modified and filled-in background, even including some consistent lighting effects between the foreground and background.

The network shown in Figure 4 follows an encoder-decoder architecture and is trained in an adversarial manner.

The training objective function comprises a conditioned adversarial loss and L1 photometric loss.

During adversarial training, the discriminator D tries to get better at classifying given images as real or synthetic, while the transformation network T tries to improve in fooling the discriminator. The L1 loss penalizes the distance between the synthesized image T(x) and the ground truth image Y, which encourages the sharpness of the synthesized output:

3. Experiments & Results

This approach enables us to take full control of the rigid head pose, facial expression, and eye motion of a target actor in a video portrait, thus opening up a wide range of video rewrite applications.

3.1 Reenactment under full head control

This approach is the first that can photo-realistically transfer the full 3D head pose (spatial position and rotation), facial expression, as well as eye gaze and eye blinking of a captured source actor to a target actor video.

Figure 5 shows some examples of full-head reenactment between different source and target actors. Here, the authors use the full target video for training and the source video as the driving sequence.

As can be seen, the output of their approach achieves a high level of realism and faithfully mimics the driving sequence, while still retaining the mannerisms of the original target actor.

3.2 Facial Reenactment and Video Dubbing

Besides full-head reenactment, the approach also enables facial reenactment. In this experiment, the authors replaced the expression coefficients of the target actor with those of the source actor before synthesizing the conditioning input to the rendering-to-video translation network.

Here, the head pose and position and eye gaze remain unchanged. Figure 6 shows facial reenactment results.

Video dubbing could also be applied by modifying the facial motion of actors speaking originally in another language to an ensign translation, spoken by a professional dubbing actor in a dubbing studio.

More precisely, the captured facial expressions of the dubbing actor could be transferred to the target actor, while leaving the original target gaze and eye blinks intact.

4. Discussion

In this post, I presented Deep Video Portraits, a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

In contrast to existing approaches that are restricted to manipulations of facial expressions only, the authors are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.

The authors have shown, through experiments and a user study, that their method outperforms prior work, both in terms of model performance and expanded capabilities. This opens doors to many applications, like video reenactment for virtual reality and telepresence, interactive video editing, and visual dubbing.

5. Conclusions

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

Till then, see you in the next post! 😄

For the enthusiastic reader:
For more details on “Deep Video Portraits” check out the formal project page or check out their video demo.