Semantic Image Inpainting with Context Encoders

In this post, we are going to understand the underlying concepts of semantic image inpainting and various techniques developed for the same.

However, we are going to focus on an effective and elegant state-of-the-art context encoder and generative model based approach for image inpainting.

Let’s get going!

The model and architecture discussed in this post is referred from the research paper link given below. Kindly refer to the research paper for the complete implementation:

Research Paper Link

What is Semantic Image Inpainting?

Imagine having an old but precious photograph and finding out a few parts of it had corrupted. You want to cherish the moment captured by the photograph forever and recover the image. Image inpainting can save the day!

Given a corrupted image with parts of the image missing or distorted, semantic image inpainting refers to the filling up of these regions with the help of the available visual data. Semantic image inpainting, as its name suggests, also takes into account the context of the image and pixels surrounding the missing regions while filling these regions.

A few real world applications of semantic image inpainting:

Restoring History: Semantic image inpainting can be helpful to restore historical artifacts and give them new life. It can also be helpful in recovering ancient hieroglyphics, books, and images of high heritage value without any human effort or involvement of a subject-matter expert.
Reviving Deteriorating Art: Semantic image inpainting can prove to be a great tool for museums to revive their art. Restoring art requires great expertise and time if performed manually by artists.

Techniques Used for Semantic Image Inpainting

Various techniques have emerged over time to fulfill the need of reviving distorted images and providing satisfactory results close to the ground truth.

A few of them are discussed below:

Computer-Vision Based Techniques

Most of the computer vision techniques developed earlier were based on scene completion. The model had a huge database of millions of images. When encountering a corrupt image, it matched the rest of the images with the dataset, and upon finding a high similarity, it picked up a patch from the image in the dataset and fit it in the corrupted region. This technique only works for small regions.

Pixel-Based Neighborhood Techniques

Given an image with a few regions missing, pixel-based techniques at their core apply KNN (K-nearest neighbors) to fill these corrupted pixels with the pixel having the highest vote. A point to be noted here is that only pixels close to the corrupted region are given a vote, since it is highly likely that the pixels close to the region are more similar in context than the entire image. These non-semantic techniques, however, do not work when an entire scene or object is missing.

Context-Based Machine Learning Techniques

Context-based techniques form the base of the technique discussed in the following sections of this post. While trying to perform semantic image inpainting, let us think about how our brain might perceive the same task. When we look at a picture and its distorted regions, we take into account the context of the image and the visual information surrounding the missing region and try to recreate it in our brain. Something similar is the idea of semantic image inpainting that we will look at here.

For reconstructing an image, we need two types of information:

Perceptual Information: Understanding the content of the image.
Contextual Information: Identifying and relating contents in the image as a complete picture.

Below we are going to focus on a technique which involves a combination of context-based techniques and auto-encoder networks.

Auto-Encoders

Auto-encoders are neural networks which take an input image and try to learn its feature representation — they try to retain all possible information about an image by passing it through a low-dimensional “bottleneck” layer so as to gain a compact representation of the image and the scene presented in the image. These networks are made up of:

Encoder: Which downsamples the image by passing it through convolutional filters and has a bottleneck layer at last which provides the compact feature representation of the image.

Decoder: Which takes the representation provided by the encoder as input and tries to reconstruct the image according to the same.

Context Encoders

The model used in the above research paper is known as context encoders since the network consists of two parts: An encoder for capturing the context of an image into feature representation and a decoder which uses the representation provided by the encoder to fill up the missing content.

Issue with Conventional Auto-Encoders

Conventional auto-encoders are not fit for the problem of image inpainting because they essentially just compress and decompress the image without learning any semantic information about the image. The network is not conditioned to gain an understanding about the context of the image.

Dataset

The images for training the context encoder network are taken from the ImageNet dataset and Paris Street View Dataset. Images are masked with the dropped out or corrupt region set to zero. The masks can be of arbitrary shape and size, but three different strategies were used:

Central Region: Removing 1/4th of the image part from the centre.

Random Block: Overlapping blocks at arbitrary positions covering 1/4th of the image removed.

Random Region: Removing arbitrary shapes, covering 1/4th of the image. This approach seemed to work well with the model.

Network Architecture

Encoder

The encoder is inspired from the AlexNet architecture with five convolutional layers followed by max pooling. The encoder, upon being provided an image of 227*227, provides a 6*6*256 dimensional feature representation. The weights of the model are randomly initialized.

Note: For the purpose of using an adversarial loss, a discriminator network is also used. To learn more about generative models, refer to this series.

Channel-wise fully-connected layer

The propagation of information within the feature representation is handled by channel-wise fully-connected layer. The fully-connected layer with groups has all the nodes connected to each other. This layer joins the output of the encoder to the input of the decoder. If the input layer has m feature maps of size n×n, this layer will output m feature maps of dimension n×n followed by stride 1 convolution. This restriction of being fully-connected only in groups has lesser parameters (mn⁴) as compared to the conventional fully-connected layers (m²n⁴) and thus saves computation time.

This layer allows each unit of the decoder to perceive the complete image content.

Decoder

The output from the channel-wise fully-connected layer is fed to the decoder, which reconstructs the image and the missing content. The decoder has five up-convolution layers each followed by a ReLU activation function so as to reach the target image size. Up-convolution can be thought of as an image upsampled and convolved which results in an image with higher resolution.

Loss Function

To understand how well/poorly our model is working, we monitor the value of loss functions for several iterations. Loss functions are essentially the difference between the predicted and expected values. They give us an idea of the accuracy of the model. To learn more about loss functions, refer to this article.

Designing the loss function for semantic image inpainting is a tricky task. There are two factors around which the loss function should be constructed:

The overall image should be as realistic as possible after filling the missing region.
The context and the overall structure of the missing content must be captured.

Thus, there is a need for a loss function which is a combination of two losses, known as the joint loss:

Re-Construction Loss

The re-construction loss is essentially a masked and normalized L2 loss that is the sum of the squared differences in predicted and expected values.

Here, x is the image, F is the context encoder, and M is a binary mask depicting 0 for corrupt regions and 1 for the rest of the image. The re-construction loss is able to identify a rough outline of the missing object but does not give any fine details. Reason being, L2 works on pixel-based difference in values, and minimizing the value of the loss function (in this case, L2 distance) will lead to overly smooth textures which are undesirable. Thus, there is a need to address this problem by using another loss function in addition to the re-construction loss.

Adversarial Loss

This is derived from generative adversarial networks (GANs). The learning procedure takes two networks — an adversarial discriminator D that is provided with real images and samples generated by another network called Generator G, and tries to distinguish between the real and fake ones, while G tries to fool the discriminator by generating images that appear realistic.

For more information on GANs, refer to this series.

For accommodating adversarial loss, the context encoder F is taken as G, and the loss function favors that not only the missing content but the entire output image looks realistic. Thus, adversarial loss keeps a check on the images generated by the decoder and helps favor the generation of sharper and realistic images.

In the above formula, we try to increase the probability of the discriminator being right, so that the generator is compelled to perform better with time and generates realistic images to try and fool the discriminator.

Training Parameters

The model was developed in Caffe and PyTorch and was trained to minimize the value of loss function discussed in the previous section. Results are slightly better if the images are resized to 128×128. The network uses Adam for optimization. The co-efficient values for the combined loss function are λrec = 0.999 and λadv = 0.001. A learning rate of 1/10⁴ was used for the images with random region corruption. The network was run for almost 100k iterations and took 14 hours on a Titan X GPU.

Note: For sharper inpainting, the pooling layers of the entire network were replaced with convolutions of the same kernel size and stride, thus resulting in a pool-free encoder. Thus, for re-construction based problems, pooling should be avoided.

Results

The context-encoder model produced significantly better and sharper results for filling up the missing regions with contextual information available to it. These results were satisfactorily better than other image inpainting models. A few glimpses of the results are shown below:

Conclusion

In this post, we discussed semantic image inpainting as a real-world problem and discussed a state-of-the-art approach at length. Image inpainting is an interesting and widely popular problem and the context based auto-encoders seem to work satisfactorily for the same.

All feedback is welcome and appreciated — I’d love to hear what you think of this article.