This will be a short and concise tutorial on how to build a facial recognition system with JavaScript, using faceapi.js built on Tensorflow.js; hence we won’t be interacting with Tensorflow.js directly.
Goals ⛳️
- Detect faces in images
- Switch webcam on with JavaScript and recognize specific faces with it
- Add a custom image filter (like Snapchat) to your detected/recognized faces
This is what our final model output should look like
Problem
We want to build an ML system, that when given a specific input image, will tell us if a face in the input image is similar a face in our database. 😉
To solve this problem, there are several tasks we need to accomplish. Firstly, we must be able to detect a face.
There are different deep learning algorithms for object detection, such as YOLO, SSD, and the like, but since this tutorial is not explicitly focusing on this task, what object detection basically does is:
- Leverages training data containing images with bounding boxes framing the target object—see the right-hand side of the image above (the blue boxes).
- Pass an input image through a deep learning model, such as YOLO
- Predict the bounding box for a detected object, and compare the predicted box with the original bounding box of the input image
Face detection: Same principles in object detection is also applied for detecting Face in an image. Just that this time around our object is the face and not a car. The basic step goes like this
- First you can decide to annotate the face your self, what I mean by annotate is to draw a rectangular box over the regoin where the face is located. Lot of tools such as LabelImg or VOTT can be used to annotate the image.
- Or better still, we can download some set of already annotated face dataset, you can check here for examples of such dataset.
- Now we have a dataset containing the face and a corresponding file containg the boundary box of the face
The train/label/image1.txt contains the boundary box for the location of the face coordinate in train/image1.jpg , the label file contains numbers in this format
- Now we are set to create a model (let assume we are using YOLO) to predict the exact same thing in the train/label/image1.txt if train/image1.jpg is passed as input into the model.
Facial LandMark: Different coordinate point in which some facial features (such as eye, nose, mouth ) are located. And then a cnn model is built same as that used for image classification, just that this time around it is a regression problem, which aim is to predict the facial features point.
Sometimes, while drawing bounding box for the predicted images, the box might not be centered directly on the face. To help align the boundary box detection result with the face,the landmark features detected is used. And for faceapi.js, 68 landmark points are always predicted.
After the face is detected and aligned, we compare the face with our reference images. To do this, we pass the image of the detected and align through a facial recognition model (i.e. ResNet 50), which helps us extract the descriptive features from the input face image. We also do the same for the reference images—we pass each of them through this descriptive feature extractor model to extract their unique features
Once this is done, we compare each of the descriptive features extracted from the reference image to the input image by calculating their Euclidean distance. We determine the right face by identifying the reference image to which the Euclidean distance to the input image is closest to zero, or below an otherwise chosen threshold.
Now that we’ve covered an overview of this process, let’s work through an implementation.
To the code 👊
Before we start coding, we need to organize and store our reference images which contain the faces you want to recognize, in a dedicated folder. For this tutorial, one of our reference images is the character Berlin from Money Heist, the new Netflix series.
As a lover of the series, I decided to try to create a system that can recognize each of the main characters. That’s what we’ll work through below.
To do this, we’ll be using a JavaScript package called faceapi.js.
To get started, create a directory called Face-rec (or name it whatever you’d like). In this directory, we create another directory called models . Next, we can also create img and js directories, and then create an html file. Let’s assume the name of the HTML file is index.html
<html>
<head>
<title>Face App</title>
<style>
#overlay, .overlay {
position: absolute;
top: 0;
left: 0;
}
</style>
</head>
<body>
<img src="img/mhm.JPG" id="refimg" />
<canvas id="reflay" class="overlay"></canvas>
<script src="js/jquery-2.1.1.min.js"></script>
<script src="js/face-api.js"></script>
<script>
$(document).ready(function(){
});
</script>
</body>
</html>
The image tag <img /> is used to load the image in the browser, and we also define our canvas tag here. Canvas is used to draw in the browser using HTML and JavaScript, so we’ll be using it to draw the bounding box and the facial landmarks.
We set the position of the canvas to absolute so that it doesn’t have a fixed position—this will help in laying the canvas on each of the faces detected in the image. faceapi.js and jquery.js are the main libraries we’ll be using. JQuery will help to manipulate the DOM.
$(document).ready(…) is used to check the readiness of the page. If the document (DOM) isn’t ready, it cannot be manipulated with JavaScript.
The next step is to load the deep learning models we’ll be using. To do that, we need to download all the models required from the faceapi.js repo here; or better still, you can clone this repo for this tutorial to have access to the code and model weights.
const MODEL_URL = '/models'
await faceapi.loadSsdMobilenetv1Model(MODEL_URL)
await faceapi.loadFaceLandmarkModel(MODEL_URL)
await faceapi.loadFaceRecognitionModel(MODEL_URL)
Now we’ve load the models that we need. loadSsdMobilenetv1Model is used to load the ssd object detection model, and we also load the model for the facial landmark and facial recognition tasks.
But there is something we need to understand—since we aren’t training a model from scratch, all we need to do is used the pre-trained model; hence the pre-trained model is the one being loaded in the script above.
So what is a pre-trained model? Once a model is trained, in order to use it for future projects, all we need to do is save the model weights after training. Hence, whenever the model is needed, we can load the weights into the model and then use it to predict.
Now let’s pass the image through all the models at once:
const img= document.getElementById('refimg')
const canvas = $('#reflay').get(0)
let fullFaceDescriptions = await faceapi.detectAllFaces(img).withFaceLandmarks().withFaceDescriptors()
Before moving to the next step, we should draw the bounding box and the landmarks for each of the faces.
faceapi.draw.drawDetections(canvas, fullFaceDescriptions)
faceapi.draw.drawFaceLandmarks(canvas, fullFaceDescriptions)
Let join the code together to draw the bounding boxes and landmarks:
<html>
<head>
<title>Face App</title>
<style>
#overlay, .overlay {
position: absolute;
top: 0;
left: 0;
}
</style>
</head>
<body>
<img src="img/mhm.JPG" id="refimg" />
<canvas id="reflay" class="overlay"></canvas>
<script src="js/jquery-2.1.1.min.js"></script>
<script src="js/face-api.js"></script>
<script>
$(document).ready(function(){
async function face(){
const MODEL_URL = '/models'
await faceapi.loadSsdMobilenetv1Model(MODEL_URL)
await faceapi.loadFaceLandmarkModel(MODEL_URL)
await faceapi.loadFaceRecognitionModel(MODEL_URL)
const img= document.getElementById('refimg')
let fullFaceDescriptions = await faceapi.detectAllFaces(img).withFaceLandmarks().withFaceDescriptors()
const canvas = $('#reflay').get(0)
faceapi.matchDimensions(canvas, img)
fullFaceDescriptions = faceapi.resizeResults(fullFaceDescriptions, img)
faceapi.draw.drawDetections(canvas, fullFaceDescriptions)
faceapi.draw.drawFaceLandmarks(canvas, fullFaceDescriptions)
}
face()
})
</script>
</body>
</html>
From the image above, we can see that the landmarks aren’t properly centered—multiple faces detected will require CSS styling, but for a single face, the current CSS styling should work.
Then we move to the next step of the solution, which is to pass the reference images through the same process as the input images and gather their descriptive features.
const labels = ['prof', 'rio', 'tokyo', 'berlin', 'nairobi']
const labeledFaceDescriptors = await Promise.all(
labels.map(async label => {
// fetch image data from urls and convert blob to HTMLImage element
const imgUrl = `img/${label}.jpg`
const img = await faceapi.fetchImage(imgUrl)
// detect the face with the highest score in the image and compute it's landmarks and face descriptor
const fullFaceDescription = await faceapi.detectSingleFace(img).withFaceLandmarks().withFaceDescriptor()
if (!fullFaceDescription) {
throw new Error(`no faces detected for ${label}`)
}
const faceDescriptors = [fullFaceDescription.descriptor]
return new faceapi.LabeledFaceDescriptors(label, faceDescriptors)
})
);
We name the reference image (with the name of the face in question) and assign it to a variable label. Next, we use faceapi.fetchImage() to load each of the images from their directories, and pass them into the face detection, landmark detection, and the descriptive features extractor model. Then each of the labels is assigned to their respective features.
And don’t forget the last step, where we’ll match the reference feature descriptor to the input image feature descriptor using Euclidean distance, to which we can assign a threshold.
const maxDescriptorDistance = 0.6
const faceMatcher = new faceapi.FaceMatcher(labeledFaceDescriptors, maxDescriptorDistance)
const results = fullFaceDescriptions.map(fd => faceMatcher.findBestMatch(fd.descriptor))
Here we assign a threshold of 0.6—this means that we want the reference image with a Euclidean distance less than 0.6 to represent the input image. Hence, we assign the name of the reference image to the face.
faceapi.FaceMatcher() takes in the reference feature descriptor and the threshold to initialize the object to calculate the Euclidean distance and apply the threshold. This is done via the faceMatcher.findBestMatch() function by looping through all the features detected for each face in the input image.
We can then go ahead and name the faces:
results.forEach((bestMatch, i) => {
const box = fullFaceDescriptions[i].detection.box
const text = bestMatch.toString()
const drawBox = new faceapi.draw.DrawBox(box, { label: text })
drawBox.draw(canvas)
})
In the code above, we loop through the result from the facematcher to get the best reference image for each of the faces in the input image. We grab the bounding box using fullFaceDescriptions[index].detection.box, and we use faceapi.draw.DrawBox(detection_box, label) to draw the canvas.
Let’s wrap the code up into a single file:
<html>
<head>
<title>Face App</title>
<style>
#overlay, .overlay {
position: absolute;
top: 0;
left: 0;
}
</style>
</head>
<body>
<img src="img/mhm.JPG" id="refimg" />
<canvas id="reflay" class="overlay"></canvas>
<script src="js/jquery-2.1.1.min.js"></script>
<script src="js/face-api.js"></script>
<script>
$(document).ready(function(){
async function face(){
const MODEL_URL = '/models'
await faceapi.loadSsdMobilenetv1Model(MODEL_URL)
await faceapi.loadFaceLandmarkModel(MODEL_URL)
await faceapi.loadFaceRecognitionModel(MODEL_URL)
const img= document.getElementById('refimg')
let fullFaceDescriptions = await faceapi.detectAllFaces(img).withFaceLandmarks().withFaceDescriptors()
const canvas = $('#reflay').get(0)
faceapi.matchDimensions(canvas, img)
fullFaceDescriptions = faceapi.resizeResults(fullFaceDescriptions, img)
faceapi.draw.drawDetections(canvas, fullFaceDescriptions)
faceapi.draw.drawFaceLandmarks(canvas, fullFaceDescriptions)
const labels = ['prof', 'rio', 'tokyo', 'berlin', 'nairobi']
const labeledFaceDescriptors = await Promise.all(
labels.map(async label => {
// fetch image data from urls and convert blob to HTMLImage element
const imgUrl = `img/${label}.jpg`
const img = await faceapi.fetchImage(imgUrl)
// detect the face with the highest score in the image and compute it's landmarks and face descriptor
const fullFaceDescription = await faceapi.detectSingleFace(img).withFaceLandmarks().withFaceDescriptor()
if (!fullFaceDescription) {
throw new Error(`no faces detected for ${label}`)
}
const faceDescriptors = [fullFaceDescription.descriptor]
return new faceapi.LabeledFaceDescriptors(label, faceDescriptors)
})
);
const maxDescriptorDistance = 0.6
const faceMatcher = new faceapi.FaceMatcher(labeledFaceDescriptors, maxDescriptorDistance)
const results = fullFaceDescriptions.map(fd => faceMatcher.findBestMatch(fd.descriptor))
results.forEach((bestMatch, i) => {
const box = fullFaceDescriptions[i].detection.box
const text = bestMatch.toString()
const drawBox = new faceapi.draw.DrawBox(box, { label: text })
drawBox.draw(canvas)
})
}
face()
})
</script>
</body>
</html>
When we refresh the browser, our predictions should look something like this:
Detecting and recognizing faces via the webcam 🍷
In simple terms, a video is just a collection of images called frames. These frames move at a certain rate per seconds in other to form a video.
Now, to run facial recognition through the videos obtained from a camera, we need a way to actually obtain each of the frames.
And since we need to process the video in real-time, we need to define our own frame rate to always obtain frames from the video (we will come back to this later).
Hence, for each of the frames obtained per seconds, we run the same code we used above for image method to detect faces in each frame.
HTML 5 has a function called getUserMedia , which gives us access to the system camera and audio, based on constraints and permissions you’ve defined. But since we’re focusing on the video aspect, we’ll be using the video constraint.
if (navigator.mediaDevices.getUserMedia) {
navigator.mediaDevices.getUserMedia({ video: true })
.then(function (stream) {
video.srcObject = stream;
})
.catch(function (err0r) {
console.log("Something went wrong!");
});
}
We specify the constraint that we want—.getUserMedia({video: true})—then get the stream data and load it into the video element.
But first, we need to actually define the video element:
<video autoplay="true" id="videoElement"></video>
Then, we access this video element in JavaScript and read the camera data into it:
let video = document.querySelector("#videoElement");
To learn more about the getUserMedia, I’ve provided a link in the reference section below containing a detailed example.
Just like when working with static images, we need a canvas on which to draw the bounding box of each of the faces detected. But in order to do that, we must be able to get the width and height of the video frame and use that to resize the canvas:
$("#videoElement").bind("loadedmetadata", function(){
displaySize = { width:this.scrollWidth, height: this.scrollHeight }
................
})
Since videos are always in frames, we’ll be using the same code we used for the image-based face detection—the only difference being we’ll be inputting the face detection and recognition code in this function:
facedetection = setInterval(async () =>{
// face detection and recognition code
},300)
If you’ll recall, this was a point I made and said we’d return to—it’s time to address this issue. In the above code, we used setInterval to define our frame rate per second at which we want to obtain a frame from the video and run the facial recognition code.
In the above code, we set the frame rate to 300ms , but this can cause a minor problem with lag. For example, if a single face is detected at the previous 300ms, but before the next 300ms the face is tilts to the right, we need to wait for the next 300ms before our code can detect the change.
You might not notice this lag since 300ms is a pretty low frame rate, but if you still care about the lag, the time can be set to 0ms—this will constantly obtain the frame.
But using 0ms might be a bad decision, since 0ms is like an infinite loop of a computationally-intensive operation. Imagine what would happen when you run while(True)—it will likely crash the browser.
<html>
<head>
<title>Webcam</title>
<style>
#container {
margin: 0px auto;
width: 500px;
height: 375px;
border: 10px #333 solid;
}
#videoElement {
top: 0;
left:0;
width: 500px;
height: 375px;
background-color: #666;
}
#overlay, .overlay {
position: absolute;
top: 0;
left: 0;
}
</style>
</head>
<body>
<!-- <div id="container"> -->
<canvas id="canvas" class="overlay"></canvas>
<video autoplay="true" id="videoElement"></video>
<!-- </div> -->
<script src="js/jquery-2.1.1.min.js"></script>
<script src="js/face-api.js"></script>
<script>
$(document).ready(function(){
let video = document.querySelector("#videoElement");
let currentStream;
let displaySize;
if (navigator.mediaDevices.getUserMedia) {
navigator.mediaDevices.getUserMedia({ video: true })
.then(function (stream) {
video.srcObject = stream;
})
.catch(function (err0r) {
console.log("Something went wrong!");
});
}
let temp = []
$("#videoElement").bind("loadedmetadata", function(){
displaySize = { width:this.scrollWidth, height: this.scrollHeight }
async function detect(){
const MODEL_URL = '/models'
await faceapi.loadSsdMobilenetv1Model(MODEL_URL)
await faceapi.loadFaceLandmarkModel(MODEL_URL)
await faceapi.loadFaceRecognitionModel(MODEL_URL)
let canvas = $("#canvas").get(0);
facedetection = setInterval(async () =>{
let fullFaceDescriptions = await faceapi.detectAllFaces(video).withFaceLandmarks().withFaceDescriptors()
let canvas = $("#canvas").get(0);
faceapi.matchDimensions(canvas, displaySize)
const fullFaceDescription = faceapi.resizeResults(fullFaceDescriptions, displaySize)
// faceapi.draw.drawDetections(canvas, fullFaceDescriptions)
const labels = ["img/steveoni"]
const labeledFaceDescriptors = await Promise.all(
labels.map(async label => {
// fetch image data from urls and convert blob to HTMLImage element
const imgUrl = `${label}.JPG`
const img = await faceapi.fetchImage(imgUrl)
// detect the face with the highest score in the image and compute it's landmarks and face descriptor
const fullFaceDescription = await faceapi.detectSingleFace(img).withFaceLandmarks().withFaceDescriptor()
if (!fullFaceDescription) {
throw new Error(`no faces detected for ${label}`)
}
const faceDescriptors = [fullFaceDescription.descriptor]
return new faceapi.LabeledFaceDescriptors(label, faceDescriptors)
})
);
const maxDescriptorDistance = 0.6
const faceMatcher = new faceapi.FaceMatcher(labeledFaceDescriptors, maxDescriptorDistance)
const results = fullFaceDescriptions.map(fd => faceMatcher.findBestMatch(fd.descriptor))
results.forEach((bestMatch, i) => {
const box = fullFaceDescriptions[i].detection.box
const text = bestMatch.toString()
const drawBox = new faceapi.draw.DrawBox(box, { label: text })
drawBox.draw(canvas)
})
},300);
console.log(displaySize)
}
detect()
});
})
</script>
</body>
</html>
And there it is! Those are the basic processes for creating a facial recognition system in JavaScript that works on both static images and webcam video feeds.
Bonus Section (Adding Selfie filter)
One cool use case for facial recognition involves using the landmarks for each of the facial features to build a simple selfie filter (like in Snapchat), for adding things like sunglasses or a beard to the recognized face.
The major issue with this, however, is that the landmarks from faceapi.js include 68 points. But for each of the points, we don’t know which one belongs to the eyebrow, the mouth, the nose, and the jaw.
The challenge here is to map the filter onto landmarks that correspond to different labeled facial features. To try this, I decided to focus in Rio’s face, as it would be a lot to try to add this to all faces at once.
Our goal is to place a pair of sunglasses on one of the faces in the image:
The first thing we need to do is to load the image—hence, we have a .png image asset that we add to the /img folder.
var con = canvas.getContext("2d");
var imgr = new Image()
imgr.src = "/img/glass.png";
imgr.onload = () => {
con.drawImage(imgr, dx = 471.27335262298584, dy=366.8473286212282 - 4,dw =101.75236415863037,dh =40);
};
When drawing on a canvas in HTML, we get the canvas context, and we use context.drawImage() to draw the image once it’s already loaded (Note: “draw” in this context refers to placing the image in the canvase) . dx and dy are used to represent the x,y coordinates from which we start the drawing. dw is specified as the width of the bounding box, and dh is an estimation of how much we should reduce the height of the image.
To obtain the landmark we use:
console.log(fullFaceDescriptions[0].landmarks)
As you can see below, we were able to successfully add the sunglasses to one of our faces (Rio). To use it for all of the faces, we use the foreach method to loop through the fullFaceDescriptions landmarks.
With some trial and error with different facial landmarks, we can define a custom function to detect which of the landmark points belongs to the eye, nose, mouth, and jaw. You can try it out.
Comments 0 Responses