Hello readers! In this article I’ll be discussing another cool trick we can do with deep learning. Specifically, we’ll see how to change the background in an image or in a video, just like they do in news rooms and in movies.
In those cases they use (expensive and bulky) green screens to achieve this task.
We’ll simply be using the magic of deep learning to achieve that feat.
I have explained a bit about how we can make a computer intelligent and make it do some unbelievable stuff on images or videos using convolutional neural networks (CNNs) in my previous blog, where I gave an overview about how we can make use of these CNNs to track people in a video:
In this effort to change image/video frame backgrounds, we’ll be using image segmentation an image matting. I’ll provide a brief overview of both tasks, and then I’ll explain how to combine them.
Overview: Image Segmentation
Image segmentation is, essentially, a classification task in which we classify each pixel as belonging to one of the target classes. So when you pass an image through a segmentation model, it will give one label to each of the pixels that present in the image.
If we then color each pixel based on the class that pixel belongs to, we’ll be able to easily locate objects and their boundaries. Here’s what this looks like in practice:
The image on the left is the input image, and the one on the right is the output. As you can see, each pixel belonging to a particular target class is a different color. In this case, pixels belonging to houses are red, and pixels belonging to the non-road ground is blue.
This way, we can get the location of objects and all the pixels belonging to that object in a given image. This gives us a fine-grain understanding of an image.
Now that we understand the “what” of segmentation, let’s discuss the “how”.
Broadly, segmentation can be divided into two categories:
- Instance Segmentation
- Semantic Segmentation
Again, it’s my duty to provide some explanation for both of these categories. otherwise this won’t be a good article.
Semantic segmentation involves detecting objects in an image where similar objects are labeled as one predefined class. So if there are 4 persons in an image, then these 4 people will all be labeled as one class. The meaning can also be derived from the name. All the objects having same semantics will be labeled as one class. Fig 1 (above) is an example of semantic segmentation.
Instance segmentation takes things a step further. Each pixel is classified based on the object instance it belongs to. So if there are 4 persons in an image, then each of those 4 people will be classified as a different entity. An example below:
How to perform semantic segmentation
The primary model architecture for semantic segmentation is called U-Net. It’s a bunch of CNN layers arranged such that the size of the output from each layer gets smaller first, and then increases to the same size as that of the input image. The architecture is visualized in the image below. It has a U shape overall (Hence, ‘U’ Net).
As you can see, the input is first made smaller and smaller to learn the features using convolutional kernels, and then the output is increased in size to match the size of the input image.
We have a labeled target for each input image while training, where a label is assigned to each pixel according to which class it belongs. So when the image is passed through the above-mentioned model architecture, we obtain an output that we can compare to the target image.
Typically, categorical cross-entropy loss is calculated for each pixel, and the average is taken over all the pixels. That loss obtained is then backpropagated and weights of the kernels are adjusted accordingly, using some variant of SGD.
When training is done, we obtain a label for each pixel for any given input image, and we can accordingly overlay, just like in the example images above.
If you want to dive deeper into U-Net architecture, check out the following article, which explains the architecture in detail:
How to do Instance Segmentation
Instance segmentation is usually done using Mask- RCNN architecture. This also has a bunch of CNNs, but the difference is in the way they are arranged. Mask RCNN is a two-stage detector, about which I explained in my previous article.
There’s a convolutional backbone that extracts the features from the image and sends them to the first stage.
In the first stage, a number of regions are proposed by a region proposal network. These regions are the regions of interest in the image. You can think about them as a bunch of rectangular cutouts from the image, which can be of different sizes and used for further classification.
These regions are then sent to the next stage, which again uses CNN kernels to classify them pixel-wise and determine where in that region the object is located—this gives us a bounding box around that object. The regions are proposed predicting the cutout’s four coordinates.
In the second stage, after obtaining the proposed regions, we use two heads called the mask head and box head. These heads take all the proposed regions and predict new and refined bounding boxes around the object (done via the box head) and perform pixel-wise classification to obtain a segmentation output (done via the mask head).
The final output is a set of bounding boxes and segmentation masks for each object in the image. There is a target box location and target segmentation mask, against which the output is compared and loss is calculated. The error is then backpropagated through the whole network.
The error that’s backpropagated adjusts the weights in all the layers—i.e in an RPN layer and in both heads, also. That means the weights are adjusted throughout the network. So if you connect a bit more, you can see that the bounding box is acting as a signal to enhance the segmentation output, and vice versa.
Also, both are acting as a signal for the RPN layer to adjust the weights, such that the initial proposals are also correct.
To get a deeper understanding of how Mask RCNN actually works, check out this deep dive:
Now that we have a basic understanding of the what and how of image segmentation, we’ll now target our focus on instance segmentation for our specific task (changing backgrounds). Next, we’ll explore image matting.
Overview: Image Matting
Image matting is a task in which an object is cutout from an image and its opacity is estimated. This is also done using deep learning. An image is passed through a CNN architecture, from which a foreground and background are predicted. We can define the foreground as the object of interest and the background as the rest of the image. In our case (and in many cases) the person is the foreground.
Along with foreground and background, one more thing is predicted—the alpha matte. This alpha matte defines the opacity value in the image per pixel. The value lies between 0–1, with 1 being completely opaque and 0 being completely transparent. As it is defined for each pixel, it has to be of the same size as that of the image.
To help better explain this concept, let’s consider an example.
Given an input image, we’ll have a target object, which we’ll consider as the foreground, and the rest of the image as background. The alpha values for the pixels corresponding to the foreground will be greater than 0 and will depend upon the transparency of the object. So we can multiply the alpha values of the corresponding object’s pixels with the RGB values of those pixels.
Typically, the image is described by the following formula:
where C_i is the observed color value at pixel site i , and F_i and B_i are the pixel colors in the foreground and background layers at that same site, and α_i is the level of mixing between the foreground and background.
How image matting works with segmentation
To connect image matting with the primary task at hand (segmentation), let me relate the two, and then take a look how image matting is done using deep learning.
First, we’ll detect the person using image segmentation. The alpha values corresponding to those pixels should be 1, as a person is totally opaque, and the alpha value for the image background should be zero. Hence, we can multiply the alpha values with the image and obtain the output—which will be the person standing and a black background (because the alpha will be zero for the background and its multiplication with RGB values will result in black pixels).
The answer lies in the performance of the image segmentation model. The model is undoubtedly very good at localizing the person, but some important details might be missing.
For example, let’s say a person is correctly segmented, but that person’s hair or part of their body outline is not clearly segmented. If you cut that remaining part out, you won’t get a smooth boundary around the person. As such, there can be cases where some background is also labeled as the person’s body. This will make our output visually unappealing, at the very least.
Therefore, we need one more refinement on top of the segmentation output, which will make the boundaries clearer. That is where image matting using deep learning comes in.
You can see the output for a test image below. Notice how the model is able to predict the hair as foreground with great detail. (Left to right) The first is the actual image, second is its alpha matte, and the last is the extracted foreground overlaid on another background, which is obtained using the formula above.
I will connect all the dots at the end of this post. But for now, let’s develop a basic idea about our task and learn a bit more about image matting and how it is done.
How image matting works
To perform image matting, we need two things—first, the actual image and second, the trimap of the image.
Now, the obvious question is, what is trimap? As the name suggests, it’s a map of original image containing three values. One for the foreground, one for the background, and one for the ambiguous region where we aren’t sure if it belongs to foreground or background.
Let’s keep this task related to our final goal. We’ll define foreground pixels as those pixels in the middle of the objects, and the ambiguous area will be the boundary of the object. It will also include just a bit of the background area. The rest of the pixels will be labeled as background pixels.
For example, in the above image, you can see that the area that we’re sure about are shown as white pixels, and the area around the hair and boundaries with intricate details are labeled as ambiguous regions. And rest is black, which is the clearly-defined background.
So the trimap, along with the original image, is fed into the model to predict the alpha matte, such that the final result looks like the image below:
Here, the image on the right is the predicted alpha matte, and the image on the left is the input image. You can see how it captured the fine-grain details of the hair.
Now we know how image matting helps refine and improve our segmentation results. Next, I’ll explain a bit more about how we’re able to produce this result. I will be referring to the this paper, which has given us state-of-the-art performance in image matting.
In the above mentioned paper, the authors are taking the original image and its corresponding trimap as an input to the model and outputting three things:
- an RGB image of the foreground
- an RGB image of the background
- and a single channel alpha matte.
The network architecture they’re using is similar to the U-Net architecture that I explained above.
A bit more detail
The image and the trimap is concatenated and then fed into the model. Actually, instead of concatenating just the trimap, they are doing a bit of fancy work with trimaps.
They’re first making two channels from the trimap—one channel contains the foreground pixels, and the other one indicates the background pixels (basically the white pixels in the trimap are the foreground and the black pixles are the background).
They do not explicitly use the ambiguous region. These two channels are then passed through three different Gaussian blur kernels to obtain three outputs for each channel. So in total, they have 6 channels of trimap information and 3 channels from the original image (RGB channel).
This 9 channel image is then fed into an architecture similar to U-Net architecture, but with some minor changes. These changes include a 7 channel output, which has 3 foreground channels, 3 channels, and 1 alpha matte channel.
The network architecture is divided into two parts—namely an encoder and decoder. The left side of the U-Net architecture, which reduces the size of the image, is typically called the encoder, as it encodes the information in a lower dimensional space.
The output from this encoder is then sent to the decoder, where the output size is increased so that the final output is of same size as that of the input image.
In a simple U-Net architecture, basically these encoders and decoders are symmetric, but in the image matting paper referenced previously, they have made slight modifications by reducing the number of layers in the decoder as compared to the encoder, but the layers are constructed in such a way that the output is still of the same shape as that of the input, but with 7 channels, as I explained above.
Now that we have an architecture, we need a loss function with which to train our network.
In this paper, they are using a combination of 8 different types of losses. Their details don’t matter too much, but I’ll give a brief overview of them.
We know that we have 3 different outputs, so we also need to have targets for these outputs from which the loss can be calculated, the error can be backpropagated, and the weights can be adjusted.
In the paper, the authors have mentioned that for a given image, they have a target foreground image (image where the foreground is there and the other pixels are black), a target background image (where the foreground object is removed), and an alpha map.
So with these target images, they’re using 4 different types of losses:
- L1 loss
- Composition loss
- Gradient loss
- Laplacian Pyramid loss
They’re using these losses with the alpha image and with a combination of the foreground and background—this essentially gives us 8 different losses.
Above, you can see some very unfamiliar terms—the exact details don’t really matter, but I posted them here for those who want to do a bit more research on this.
Here’s one more dangerous-looking image that elaborates a bit more about all the losses:
On top of these losses, the authors are also inducing one more thing, which they’re calling F.B alpha fusion, where the output F,B and alpha are mathematically made relative using bayesian statistics—which is even more dangerous-looking, so for those of you who are mathematics enthusiasts, you can dive deeper in the paper.
Using this combination of loss functions, the model is trained by comparing with the target, finding the error, and then backpropagating that and using some variant of SGD to update the weights. These all are basic steps. They also have some minor details of implementation, which you can read in the paper if you want.
Summing up: We give foreground and background information to the model, then the model has to figure out which pixel belongs to fg and bg in the ambiguous region by looking at the information that we provided. Hence, during training, it’s able to learn fine-grain details like hair and body outlines.
Combining Everything And Relating To The Task
Now that we know about the fundamentals of image segmentation and image matting, let’s combine them and explain how we’ll use them to extract an image’s background from its foreground.
Also, I will explain about how we can actually code this so that it’s API-ready. I’ll cover this code implementation in my follow-up article, and will also show you one a very cool use case!
How to Combine Segmentation and Matting
There are already pre-trained models available (which I’ll be explaining how to use in next article) that will give us the segmentation output of a person in a given image. So basically, we’ll just provide an input image, and the model will output a pixel-wise probability mask, where the probability value will tell us about the precise location of the person (i.e. the foreground target). The output will be similar to the image below:
Using these probability masks, we’ll define a threshold, above which we’ll say that this is definitely the pixel belonging to person, lets say 0.9. The rest of the nearby pixels, which have values less than 0.9, will be classified as ambiguous regions, and the pixels with zero as a value will be the background.
Here’s where image matting enters the picture. We’ll pass the original image and the trimap that we generated in the previous step to the FBA Matting model and get the refined alpha matte from the trained model.
We will then use this alpha map and multiply it with the original image. This will result in all the people being cutout, and the background will be black, as the alpha value for background will be zero.
On top of this, we can change the background with any other background that we want, just by multiplying the alpha with the original image and (1-alpha) with the new background image. Hence, we’ll finally obtain an image with the same people but a different background!
I’ll explain everything about how to code this and make it easy to use in my next article. So stay tuned!
In the meantime, here is a smart tomato, just for fun.
- FBA Matting paper : https://arxiv.org/abs/2003.07932
- Learn about UNET : https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47
- Learn about Mask-RCNN : https://towardsdatascience.com/computer-vision-instance-segmentation-with-mask-r-cnn-7983502fcad1