AI depth learning object detection algorithm

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

From the beginning of 2014, the purpose of testing has been a grand break. At present, the purpose detection algorithms presented by the industry are the following:
1. Traditional destination detection algorithm: Cascade + Haar / SVM + HOG / DPM;
2. Candidate window + deep learning classification: After the candidate area is extracted, and the plan for the classification based on the deep learning method is stopped in the corresponding area, such as: RCNN / SPP-net / Fast-RCNN / Faster-RCNN / R-FCN Series Method;
3. Deep learning-based regression methods: YOLO / SSD / DenseBox and other methods;
4. Separate RNN algorithm RRC detection; Separate DPM Deformable CNN and other methods;

This article aims at the current mainstream purpose detection method to stop the simple introduction, the article is divided into four parts: the first part introduces the commonly used data sets and performance indicators of the target detection category; the second partial introduction is represented by R-CNN proposed by R Girshick. Detach the purpose detection framework of the region proposal and CNN classification (R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN). The third part introduces the purpose of the detection framework that converts purpose detection to regression problem represented by YOLO. (YOLO, SSD); Part IV introduces some of the latest purpose detection algorithm pauses.

1, data sets and performance indicators

The purpose of the detection of commonly used data sets include PASCAL VOC, ImageNet, MS COCO, and other data sets. These data sets are used by researchers to test algorithm performance or for competitions. The purpose of the detection of performance indicators to consider the location of the detection of objects and the accuracy of the prediction category, we will talk about some of the commonly used performance evaluation indicators.

1.1 Data Sets

PASCAL VOC (The PASCAL Visual Object Classification) is a well-known data set for the purposes of detection, classification, segmentation, and so on. From 2005 to 2012, a total of eight different war games were held. PASCAL VOC contains about 10,000 pictures with border boxes for exercise and research. However, the PASCAL VOC dataset contains only 20 categories and is therefore seen as a benchmark dataset for the purpose detection problem.

ImageNet released a purpose detection data set containing a border frame in 2013. The workout data set contains 500,000 images belonging to 200 categories of objects. Because the data set is too large, the amount of calculation required for exercise is very large and therefore rarely used. At the same time, because the number of categories is more than that, the purpose of detection is also very difficult. The comparison between the 2014 ImageNet dataset and the 2012 PASCAL VOC dataset is here.

Another famous dataset is the MS COCO (Common Objects in COntext) dataset established by Microsoft Corporation (see T.-Y.Lin and al. 2015). This data set is used for various contests: image header generation, destination detection, keypoint detection, and object segmentation. With regard to the purpose detection task, COCO contains a total of 80 categories. Each year, the contest's exercise and research dataset contains more than 120,000 images and exceeds 40,000 test images. The test set was recently divided into two categories, one being the test-dev data set for the researchers and the other being the test-challenge data set for contestants. The test set's tag data is not exposed to prevent overfitting on the test set. In the COCO 2017 Detection Challenge, the Contempt team won the championship with the Light-Head R-CNN model (AP 0.526). It seems that the two-stage algorithm is even more accurate.

2015 Segmentation of COCO Datasets. Source: T.-Y. Lin and al. (2015)

Targeted detection of mainstream datasets
1.2 Performance Indicators

The purpose of detecting problems is also a regression and classification problem. First of all, in order to evaluate the positioning accuracy, an IoU (Intersection over Union, between 0 and 1) needs to be calculated, which indicates the stacking level between the prediction box and the ground-truth box. The higher the IoU, the more accurate the prediction box position. Therefore, when evaluating the prediction frame, an IoU threshold (eg, 0.5) is usually set, as long as the prediction frame and real frame IoU value are greater than this threshold, the prediction frame is considered as True Positive (TP). The opposite is false positive (FP).

Regarding the two classifications, AP (Average Precision) is an important indicator, which is a concept in information retrieval and is calculated based on the precision-recall curve. For the purpose of detection, it is first necessary to individually calculate the AP value of each category, which is an important indicator for evaluating the detection effect. Taking the uniform value of each class of AP, a comprehensive index mAP (Mean Average Precision) is obtained, and the mAP index can prevent certain categories from being extreme and weaken the performance of other classes.

For purpose detection, mAP is usually calculated on a fixed IoU, but different IoU values will change the ratio of TP and FP, thus forming the difference of mAP. The COCO dataset provides the official evaluation index. Its AP calculates the uniform value of the AP under a series of IoU (0.5:0.05:0.9, see the clarification), which can eliminate the AP shake caused by IoU. In fact, the same is true of the PASCAL VOC dataset, and Facebook's Detectron has a clearer and more complete implementation.

In addition to detection accuracy, another important performance indicator of the purpose detection algorithm is speed, as long as the speed is fast, it can accomplish real-time detection, which is extremely important for some application scenarios. A commonly used measure of speed is Frame Per Second (FPS), the number of pictures that can be processed per second. Of course to compare FPS, you need to stop on the same hardware. In addition, the time required to dispose a picture can be used to evaluate the detection speed. The shorter the time, the faster the speed.

2, from Rcnn to Faster-Rcnn

From the beginning of Rcnn, Girshick introduced deep learning to the target detection category, and then successively exerted force to eventually unify all the steps of the target detection under the deep learning framework. This means that all calculation processes can be stopped within the GPU. Therefore, the calculation accuracy and calculation speed have greatly improved.

2.1 Introduction to Rcnn (textual convergence)

R-CNN (R. Girshick et al., 2014) is a series of purpose detection algorithms based on the method of region proposal. It stops the area search first, and then stops classifying candidate areas. In R-CNN, the Selective search method (J.R.R. Uijlings and al. 2012) was used to generate candidate regions. This is an apocalyptic search algorithm. It first divides the image into a lot of small areas by a simple region partitioning algorithm, and then merges them according to a certain degree of similarity through a hierarchical grouping method. The last part is the region proposals, which may contain an object.

Selective Search method: above is the segmentation result, below is the candidate box. Source: J.R.R. Uijlings and al. (2012)

Regarding a picture, R-CNN generates approximately 2,000 candidate regions based on the selective search approach. Each candidate region is then resized to a fixed size (227×227) and sent to a CNN model, resulting in a 4096-d feature. vector. This eigenvector is then fed into a multi-class SVM classifier that predicts the probability of the objects belonging to each class in the candidate region. One SVM classifier is trained for each category, and the probability of its belonging to that category is inferred from the feature vector. In order to improve positioning accuracy, R-CNN finally exercised a border box regression model. The training sample is (P, G), where P = (Px, Py, Pw, Ph) is the candidate region, and G = (Gx, Gy, Gw, Gh) is the real box, G is the largest real with P IoU Box (using only samples with IoU greater than 0.6), the return value of the returner is defined as:

Tx=(Gx−Px)/Pw,ty=(Gy−Py)/Ph

Tw=log(Gw/Pw),th=log(Gh/Ph)

When making a prediction, applying the above formula can reverse the correction position of the prediction frame. The R-CNN exercises a separate regression for each category and stops using the minimum mean squared loss function.

The exercise of the R-CNN model is multi-channel, and the CNN model first uses the image classification contest data set in 2012 ImageNet to stop the pre-workout. Then the fine-tuning of the CNN model is stopped on the detection dataset, where those candidate regions with IoU greater than 0.5 of the real box are taken as positive samples, and the remaining candidate regions are negative samples (background). A total of two versions were exercised. The first version used the 2012 PASCAL VOC dataset and the second version used the target detection dataset in 2013 ImageNet. Finally, exercise the SVM classifier for each category in the data set (note that the SVM exercise sample is not the same as the funeting of the CNN model, as long as the IoU is less than 0.3 is seen as a negative sample).

On the whole, R-CNN is very intuitive, it is to transform the detection problem into a classification problem, and use the CNN model to stop the classification, but the effect is very good. The best R-CNN model has 62.4% mAP in the 2012 PASCAL VOC data set (22% higher than the second) and 31.4% mAP on 2013 ImageNet (7.1% higher than the second ).

On the whole, R-cnn is simpler than this, but it also has two serious flaws:

(1) selective search to stop the candidate area extraction process is completed within the cpu, taking up a lot of computing time.
(2) The convolution calculation is stopped for 2000 candidate boxes, and the feature time is extracted. There are a large number of repeated calculations, which further increases the computational complexity. For the above two defects, R Girshick stopped improving in fast-Rcnn and faster-rcnn respectively.

2. 2 fast-rcnn (textual linking)

2.2.1 spp-net

Since fast-rcnn has created the idea of ssp-net, first understand the spp-net.

In the rcnn requirement, the convolution feature calculation is stopped for 2000 candidate boxes, and the 2,000 candidate boxes are from the same picture. Therefore, the author thinks about stopping the convolution calculation for the entire picture first to obtain the entire picture. Convolution features, and then according to the position of each candidate frame in the original picture, the convolution feature of the corresponding region is extracted in the convolution feature map. The eigenvectors in the convolution map are then sent to the classifier. This creates a problem where the size of each candidate box is not the same and the dimensions of the convolutional features obtained are not the same and cannot be entered. The whole link layer, resulting in classification can not be stopped, in order to unify the feature dimensions of all candidate boxes, the author designed spp-net:

The principle of the SPP layer is shown below. Assume that the size of the feature map obtained by the CNN layer is a×a (e.g. 13×13, which varies with the size of the input picture), and the set size of the pyramid is n×n bins (for different size pictures, Fixed), then the SPP layer uses a sliding window pooling, the window size win_size = ⌈ a / n ⌉, the step is stride = ⌊ a / n ⌋, using max pooling, essentially divide the feature map into n × n Sub-regions, then max pooling for each sub-region, so that regardless of the size of the input picture, after the SPP layer is obtained a fixed-size feature. Ordinarily set multiple pyramid levels, using four dimensions of 4x4, 2x2, and 1x1. Each pyramid has a feature that connects them all together and feeds the full concatenation layer behind it, thus dealing with the problem of variable-size picture input. SPP-net won third place in the ImageNet ILSVRC 2014 image classification contest.

Spatial Pyramid Pooling Layer in SPP-net

What is the relationship between SPP-net and R-CNN? In R-CNN, since the size of each candidate area is different, the user needs to resize to a fixed size before sending it to the CNN network. SPP-net can handle this problem. To move forward, R-CNN must use the CNN model to calculate the characteristics of each candidate area one at a time. This is extremely time-consuming. It is better to send the entire picture directly to the CNN network and then extract the corresponding feature area of the candidate area. Using the SPP layer can greatly reduce the amount of calculations and increase speed. The accuracy of the R-CNN model based on the SPP layer is not greatly improved, but the speed is 24-102 times faster than the original R-CNN model. This is exactly the direction that the next Fast R-CNN has improved.

2.2.2 ROI pooling layer

In fast-rcnn, the authors used a simplified version of ssp-net: only ssp-net stopped the scaling of a scale, and then downsampled directly to get the feature vector.

2.2.3 fast-rcnn overall framework

The Fast R-CNN (FastRegion-based Convolutional Network, R. Girshick 2015) is mainly intended to reduce the time spent by the candidate regions using the CNN model to extract feature vectors. It mainly created the SPP-net idea. In R-CNN, each candidate region must be individually sent to the CNN model to calculate the feature vector. This is very time-consuming. For Fast R-CNN, the input of the CNN model is the entire picture, and then divides.

PREVIOUS：Artificial intelligence science populari NEXT：Human image recognition SDK, face living