Building a robust facial recognition system that's free of bias both racial and gender is not an easy task. After all, algorithms don't create bias. We do.
The facial recognition technology market is growing rapidly. From airports in the United States relying on biometric data to screen international passengers, law enforcement depending on it to catch criminals, and social media using it to authenticate the user, facial recognition technology is the need of the hour.
The last decade was full of new state-of-the-art algorithms developed by top software development companies and groundbreaking research in the field of Deep Learning. New Computer vision algorithms were introduced as well. It all started when AlexNet, a Deep Convolutional Neural Network achieved high accuracy on the ImageNet dataset (dataset with more than 14 million images) in 2012. So what is facial recognition software? How does it work?
Before we dive into understanding how face recognition technology works, we need to understand how we recognize faces.
How do humans recognize a face?
Recognition systems in our brains are complex. In fact, scientists are still trying to figure it out. What we can assume is that the neurons in our brain first identify the face in the scene (from the person's body to its background), we extract the facial features, and store it in our own kind of database. Using our memory as a database, we can then classify the person according to their features. We have been trained on an infinitely large dataset and infinitely extensive neural network.
Facial Recognition software in machines is implemented the same way. First, we apply a facial detection algorithm to detect faces in the scene, extract facial features from the detected faces, and use an algorithm to classify the person.
The workflow of a Facial Recognition System
Face detection is a specialized version of Object Detection, where there is only one object to detect - Human Face.
Just like computational time and space trade-offs in Computer Science, there's a trade-off between inference speed and accuracy in Machine Learning algorithms as well. There are many object detection algorithms out there, and different algorithms have their speed and accuracy trade-offs.
We evaluated different state-of-the-art object detection algorithms:
To build a robust face detection system, we need an accurate and fast algorithm to run on a GPU as well as a mobile device in real-time.
In real-time inference on streaming video, people can have different poses, occlusions, and lighting effects on their face. It is important to precisely detect faces in various lighting conditions as well as poses.
We started with Haar-cascade implementation of OpenCV, which is an open-source image manipulation library in C.
Pros: Since this library is written in C language. It is very fast for inference in real-time systems.
Cons: The problem with this implementation was that it was unable to detect side faces and performed poorly in different poses and lighting conditions.
This algorithm is based on Deep Learning methods. It uses Deep Cascaded Convolutional Neural Networks for detecting faces.
Pros: It had better accuracy than the OpenCV Haar-Cascade method
Cons: Higher run time
YOLO face detection (You look only once) is the state-of-the-art Deep Learning algorithm for object detection. It has many convolutional neural networks, forming a Deep CNN model. (Deep means the model architecture complexity is enormous).
The original Yolo model can detect 80 different object classes with high accuracy. We used this Yolo facial recognition model for detecting only one object - the face.
We trained this algorithm on WiderFace (image dataset containing 393,703 face labels) dataset.
There is also a miniature version of the Yolo algorithm for face detection available, Yolo-Tiny. Yolo-Tiny takes less computation time by compromising its accuracy. We trained a Yolo-Tiny model with the same dataset, but the boundary box results were not consistent.
Pros: Very accurate, without any flaw. Faster than MTCNN.
Cons: Since it has colossal Deep Neural Network layers, it needs more computational resources. Thus, it is slow to run on the CPU or mobile devices. On GPU, it takes more VRAM because of its large architecture.
SSD (Single Shot Detector) is also a deep convolutional neural network model like YOLO.
Pros: Good accuracy. It can detect in various poses, illumination, and occlusions. Good inference speed.
Cons: Inferior to YOLO model. Though inference speed was good it was still not adequate to run on CPU, low-end GPU, or mobile devices.
Like its name, it is a blazingly fast face-detection algorithm released by Google. It accepts 128x128 dimension image input. Its inference time is in sub-milliseconds. This algorithm is optimized to be used in face recognition on mobile phones. The reasons it is so fast are:
Pros: Very Good inference speed and accurate face detection.
Cons: This model is optimized for detecting facial images from a mobile phone camera, and thus it expects that face should cover most of the area in the image. It doesn’t work well when the face size is small. So in the case of CCTV camera images, it doesn’t perform well.
The latest face recognition algorithm we used is Faceboxes. Like BlazeFace, it is a Deep Convolutional Neural network with small architecture and designed just for one class - Human Face. Its inference time is real-time fast on CPU. Its accuracy is comparable to Yolo for face detection. It can detect small and large faces in an image precisely.
Pros: Fast inference speed and good accuracy.
Cons: Evaluation is in progress.
After detecting faces in an image, we crop the faces and feed them to a Feature Extraction Algorithm, which creates face embedding- a multi-dimensional (mostly 128 or 512 dimensional) vector representing features of the face.
We used the FaceNet algorithm to create face-embeddings.
The embedding vectors represent the facial features of a person’s face. So embedding vectors of two different images of the same person will be closer and that of a different person will be farther. The distance between two vectors is calculated using Euclidean Distance.
After getting the face-embedding vectors, we trained a classification algorithm, K-nearest neighbor (KNN), to classify the person from his embedding vector.
Suppose in an organization there are 1000 employees. We create face-embeddings of all the employees and use the embedding vectors to train a classification algorithm that accepts face-embedding vectors as input and returns the person's name.
A user could apply a filter that modifies specific pixels in an image before putting it on the web. These changes are imperceptible to the human eye but are very confusing for facial recognition algorithms - ThalesGroup
New tech brings new opportunities
Advancements in facial recognition systems and computer vision have taken great leaps. But this is only the beginning of the technological revolution. Imagine how powerful the duo of face recognition algorithms and chatbot technology would be!
It's never too late to become a part of this movement.
do check this interesting blog about Distributed catching.
Register with Engati today.