A magic face detection algorithm

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

Since the anchor-based method was presented, object detection was fundamentally inseparable from this magical anchor. Only with its assistance, humans saw the dawn of real time for the first time in the inspection mission. However, in the detection of general objects, the detection tasks of certain specific objects are considered separately due to the large amount of application and the particularity of the object. The most representative of these is face detection.
A common feature of a human face in relation to other objects is that it takes up less pixels in the image. For example, in the coco data set, there is a classification of "people", but the face is only a small part of the human body, and the proportion of the whole image is even less. The S3FD [1] (Single Shot Scale-invariant Face Detector) to be introduced in this paper is to deal with this problem.
Face detection dedicated data set - widerface
Widerface can be said to be the most difficult part of the current face detection data set.

Image pixels 1024 * 732, even face pixels 10 * 13, the difficulty is incredible.
Of course, this photo only shows the problem caused by the size of the face, and other problems like occlusion, large angle, rotation, etc., because it is not the focus of this article, it will not be discussed too much.
Introduction to SSD
Since the algorithm is based on SSD improvements, first introduce SSD [2].

As shown in the SSD and YOLO network architecture, they are also the first batch of algorithms that have completed one-stage detection. It can be seen that the SSD is a full convolutional network and the prediction is stopped by layers in different locations. In other words, low-level networks are used to detect small objects, and high-level networks are used to detect large objects.
Of course, SSD also has some obvious problems, such as the recall of small objects is very common. The local reason is that when the low-level network is applied for prediction, since the network is not deep enough, effective semantic information cannot be extracted.
In short, the SSD detection speed is comparable to that of YOLO, and the accuracy is comparable to Faster RCNN, and it is suitable as a base frame to stop further improvement.
The problems encountered by the traditional anchor mechanism in small faces The author has raised four questions:

1. The face area itself is small. After a few stride, there is nothing left on the feature map.
2. The size of the face is small compared to the size of the feeling field and the anchor.
3. With regard to the existing anchor matching strategy, we can see that a tiny face with a face pixel smaller than 10*10 does not match an anchor. The issue of outer face is actually a common problem of the anchor-based approach. The larger the size difference between each level of the anchor, the more serious the mismatch phenomenon of the middle size.
4. Each grid in the diagram can be thought of as an anchor of a certain size. Being able to see the small face on the left side, the positive and negative proportions are seriously out of balance, which is particularly important in the exercise, especially the first layer.

Network construction

1. Input size 640*640, starting from the feature map size of 160*160, and continuing to the last 5*5, there are 6 levels of prediction network, anchor scale from 16*16 to the last 512*512, followed by index plus one (see Network construction obsessive-compulsive disorder is very comfortable).
2. For each prediction layer, there is only one anchor per position (a scale, ratio is 1:1). Since the scene is not distorted, the proportion of the face is about 1:1 (there may be a small local long face ratio). Arrived at 1:2, but too little negligence). Thus, the characteristic dimension of the predicted conv output is 2+4=6
3. Under the layer that is the lowest level of the prediction (that is, the feature map size is 160*160), you can see that the predicted feature dimension is Ns+4, not 2+4, and then followed by a Max-out Background label. Things, this will be mentioned later.
4. The middle conv_fc6, conv_fc7 is extracted from the fc layer of VGG and then reshaped as the initial weight.
5. Normalization layers are Normalize in SSD_caffe. I am interested in going to Github to see the Caffe code of the SSD version of weiliu89 [2].
How to deal with the problem
1. There are many stacking areas between Anchor and anchor. For example, the first level, stride is 4, but the anchor scale is 16, so there is a large stacking area between the two adjacent anchors, and the issue of the outer face mentioned above is dealt with at a certain level.
2. Improved the anchor matching strategy.
If the jaccard overlap is higher than the threshold (usually 0.5) according to the matching strategy in the SSD, evenly each face can only match 3 anchors, and the number of anchors that the tiny face and the outer face can match is mostly 0.
The authors have designed a new matching strategy:
The first step is to reduce the threshold from 0.5 to 0.35.
In the second step, for those faces that still do not match the anchor, the threshold is directly reduced to 0.1, and then the matched anchors are sorted according to jaccard overlap, and top-N are selected. This N author is designed to match the uniform value of the anchor in the first step.
Then take a look at the new and old matching strategy intuitively:
As you can see, the average line and parts have improved.
3. As mentioned earlier, the face of the small person leads to a serious imbalance in the proportion of positive and negative samples. Especially with regard to the shallowest prediction layer, on the one hand, there are many anchors (like the construction in this article, the anchor in the first level accounts for 75% of the total), and on the other hand, because the majority of the anchor is the background, the false positive is significant. Increase. So in order to reduce the false positive here, the author used max-out background.
Earlier we saw that the feature dimension predicted by the first level is Ns+4, where NS=Nm+1. Regarding the network layer that does not adopt the max-out strategy, Nm can be regarded as 1, that is, only one score of the anchor is predicted. But take 3 here, you can understand the score for the background of the anchor three times, and then take the highest of the three scores. The most straightforward result is to improve the probability that the anchor is predicted to be the background, thus reducing the false positive.

PREVIOUS：Regularization in deep learning and mach NEXT：Text OCR recognition technology