News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Target detection algorithm based on deep learning
Target detection algorithm based on deep learning
Target detection is a simple task for people, but for computers, it sees arrays with values from 0 to 255, making it difficult to directly get the high-level semantic concept of someone or cat in the image. It is not clear where the target eats the area in the current image. The target in the image may appear at any position, the shape of the target may have various changes, and the background of the image varies widely... These factors lead to target detection is not an easy task to solve.
Thanks to deep learning – mainly convolutional neural networks and candidate region algorithms, from 2014 onwards, target detection has made a huge breakthrough. This paper mainly analyzes and summarizes the target detection algorithm based on deep learning. The article is divided into four parts: the first part introduces the process of traditional target detection, and the second part introduces the combination of region proposal and CNN represented by R-CNN. Target detection framework (R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN); the third part introduces the target detection framework (YOLO, SSD) represented by YOLO to convert target detection into regression problem; The fourth part introduces some techniques and methods that can improve the performance of target detection.
First, the traditional target detection method
As shown in the above figure, the traditional target detection method is generally divided into three stages: first, select some candidate regions on a given image, then extract features from these regions, and finally classify the trained classifiers. Below we introduce each of these three phases.
1) Regional selection
This step is to locate the position of the target. Since the target may appear anywhere in the image, and the size, length, and width of the target are also uncertain, the strategy of sliding the window is initially used to traverse the entire image, and different scales and different aspect ratios need to be set. Although this exhaustive strategy includes all possible locations of the target, the disadvantages are obvious: the time complexity is too high, and too many redundant windows are generated, which also seriously affects the speed and performance of subsequent feature extraction and classification. (Actually, due to the complexity of time, the aspect ratio of the sliding window is generally fixed, so for multi-category target detection with large aspect ratio floating, even sliding window traversal can not be very good. Area)
2) Feature extraction
Factors such as the morphological diversity of the target, the diversity of illumination variations, and the diversity of the background make it difficult to design a robust feature. However, the quality of the extracted features directly affects the accuracy of the classification. (Features commonly used at this stage are SIFT, HOG, etc.)
3) Classifier
There are mainly SVM, Adaboost and so on.
Summary: There are two main problems in traditional target detection: one is that the area selection strategy based on sliding window is not targeted, the time complexity is high, and the window is redundant; the second is that the characteristics of the manual design are not very good for the diversity change. Great.
Second, the deep learning target detection algorithm based on Region Proposal
How do we solve the two main problems of traditional target detection tasks?
For the problem of sliding windows, region proposal provides a good solution. The region proposal is a location in advance to find out where the target may appear in the graph. However, since the region proposal utilizes information such as textures, edges, colors, etc. in the image, it can be guaranteed to maintain a high recall rate when fewer windows (thousands or even hundreds) are selected. This greatly reduces the time complexity of subsequent operations and the candidate window is obtained with a higher quality than the sliding window (the sliding window has a fixed aspect ratio). The more commonly used region proposal algorithms are selective search and edge boxes. If you want to know the region proposal, you can look at PAMI2015's "what makes for effective detection proposals?"
With the candidate area, the rest of the work is actually the work of image classification of the candidate area (feature extraction + classification). For image classification, it must be mentioned that in the 2012 Image Net Large-Scale Visual Challenge (ILSVRC), the machine learning leader Professor Geoffrey Hinton led the student Krizhevsky to reduce the Top-5 error of the ILSVRC classification task to 15.3% using the convolutional neural network. The second place top-5 error using the traditional method is as high as 26.2%. Since then, convolutional neural networks have dominated the image classification task, and the top-5 error of Microsoft's latest ResNet and Google's Inception V4 model has dropped to less than 4%, which has surpassed people's ability in this particular task. Therefore, it is a good choice to use CNN to classify the image after the target is detected.
In 2014, RBG (Ross B. Girshick) used region proposal+CNN to replace the sliding window + manual design features used in traditional target detection, and designed the R-CNN framework, which made a huge breakthrough in target detection and opened a deep learning goal. The craze of detection.
1) R-CNN (CVPR2014, TPAMI2015)
(Region-based Convolution Networks for Accurate Object detection and Segmentation)
The above frame diagram clearly shows the target detection process of R-CNN:
(1) Input test image
(2) Extracting about 2000 region proposals in the image by using the selective search algorithm.
(3) Each region proposal is warpd to a size of 227*227 and input to the CNN, and the output of the fc7 layer of the CNN is taken as a feature.
(4) Input the CNN feature extracted by each region proposal into the SVM for classification.
Give a few explanations for the above framework:
* The above frame diagram is the flow chart of the test. To test, we must first train the CNN model for extracting features, and the SVM for classification: it is fine-tuned with the trained model (AlexNet/VGG16) on yongImageNet for feature The extracted CNN model is then used to train the SVM with the training set feature using the CNN model.
* Scaling to the consent scale for each region proposal is because the CNN full link layer input needs to ensure that the dimensions are fixed.
* Scaling to the consent scale for each region proposal is because the CNN full connection layer input needs to be guaranteed to be fixed.
* The above diagram takes a little less process - for the SVM to classify the region proposal for bordering-box regression, the border regression is a linear regression algorithm that corrects the region proposal, in order to let the region proposal extract the window and target The window is really gentle. Because the window extracted by region proposals can't be as accurate as the manual mark, if the region proposal is cheaper than the target position, even if the classification is correct, but because of the ratio of IoU (region proposal and ground truth window intersection and union) ) below 0.5, then the equivalent is still not detected.
Summary: The test results of R-CNN on PASCAL VOC2007 increased directly from 34.3% of DPM HSC to 66% (mAP). Such a big improvement is that we have seen the huge advantage of region proposal+CNN.
But there are also many problems with the R-CNN framework:
(1) Training is divided into multiple stages, the steps are cumbersome: fine-tuning network + training SVM + training border regression
(2) The training takes time and takes up a lot of disk space: 5000 graphics generate several hundred G feature files
(3) Slow: Using the GPU, the VGG16 model takes 47 seconds to process an image.
For this slow problem, SPP-NET gives a good solution.
2) SPP-NET (ECCV2014, TPAMI2015)
(Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition)
Let's take a look at why R-CNN is so slow, and a picture takes 47s! Take a closer look at the R-CNN framework and find that after the image has been region-proposal (about 2000), each proposal is treated as an image for subsequent processing (CNN feature + SVM classification), and actually an image is taken. 2000 process of feature extraction and classification!
Is there a way to speed up? It seems that there are some, these 2000 region proposals are not part of the image, then we can completely extract the convolution layer features of the image, and then input the circle integration features of each region proposals into the fully connected layer for subsequent operations. (For CNN, most of the operations are good for the operation of the reel, which can save a lot of time). The problem now is that the scale of each region proposal is different. It is definitely not possible to input the fully connected layer directly, because the full connection layer input must be a fixed length. SPP-NET happens to solve this problem:
The figure above corresponds to the network structure diagram of SPP-NET. Any image is input to CNN. After the operation of the volume machine, we can get the features and features (such as the last convolutional layer of VGG16 is conv5_3, a total of 512 feature maps are generated) . The window in the figure is the area where the region proposal corresponds to the feature map. Only the features of these different size windows need to be mapped to the same dimension. As the input of the full connection, the detachment image can be extracted once. feature. SPP-NET uses spatial pyramid pooling: each window is divided into 4*4, 2*2, 1*1 blocks, and then each block is sampled using max-pooling, so that for each window After the SPP layer, a feature vector of length (4*4+2*2+1)*512 is obtained, and this is used as the input of the fully connected layer for subsequent operations.
Summary: Using SPP-NET can greatly speed up target detection compared to R-CNN, but there are still many problems:
(1) Training is divided into multiple stages, the steps are cumbersome: fine-tuning network + training SVM + training border regression
(2) SPP-NET fixes the convolution layer when fine-tuning the network, and only fine-tunes the fully-connected layer. For a new task, it is necessary to fine-tune the convolutional layer. (The features extracted by the model are more focused on high-level semantics, while the target detection task requires the background position of the target in addition to the information)
In response to these two problems, RBG has proposed Fast R-CNN, a streamlined and fast target detection framework.
3) Fast R-NN (ICCV2015)
With the introduction of R-CNN and SPP-NET, let's take a look at the framework of Fast R-CNN:
Compared with the R-CNN framework, it can be found that there are two main differences: one is to add a ROI pooling layer after the last convolution layer, and the other is to use the multi-task loss function to remove the border. The regression is directly added to the CNN network for training.
(1) The ROI pooling layer is actually a simplified version of SPP-NET. SPP-NET uses pyramid maps of different sizes for each proposal, and the ROI pooling layer only needs to downsample to a 7*7 feature map. There are 512 feature maps for the VGG16 network conv5_3, so that all region proposals correspond to a 7*7*512 unique feature vector as the input of the fully connected layer.
(2) The R-CNN training process is divided into three stages, and Fast R-CNN directly uses softmax instead of SVM classification. At the same time, the multi-task loss function border regression is also added to the network, so the training process of the whole network is end-to-end. (Remove the region proposal extraction phase).
(3) In the process of network fine-tuning, Fast R-CNN has also fine-tuned some of the base layers and achieved better detection results.
Summary: Fast R-CNN combines the essence of R-CNN and SPP-NET, and introduces a multi-task loss function, which makes the training and testing of the entire network very convenient. Trained on the Pascal VOC2007 training set, the result of the VOC2007 test was 66.9% (mAP). If the VOC2007+2012 training set training was used, the test result on the VOC2007 was 70% (the expansion of the data set can greatly improve the target detection performance). It takes about 3 seconds for each image to be used with VGG16.
Disadvantages: The extraction of region proposal uses selective search. The target detection time is mostly small (this proposes region proposal 2~3s, and the feature classification only needs 0.32s), which cannot meet the real-time application, and does not realize the real meaning. End-to-end training test (region proposal is extracted first using selective search). So is it possible to directly generate and classify a region proposal using CNN directly? The Faster R-CNN framework is the target detection framework that meets this need.
4) Faster R-CNN (NIP2015)
(Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks)
In region proposal+CN
Thanks to deep learning – mainly convolutional neural networks and candidate region algorithms, from 2014 onwards, target detection has made a huge breakthrough. This paper mainly analyzes and summarizes the target detection algorithm based on deep learning. The article is divided into four parts: the first part introduces the process of traditional target detection, and the second part introduces the combination of region proposal and CNN represented by R-CNN. Target detection framework (R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN); the third part introduces the target detection framework (YOLO, SSD) represented by YOLO to convert target detection into regression problem; The fourth part introduces some techniques and methods that can improve the performance of target detection.
First, the traditional target detection method
As shown in the above figure, the traditional target detection method is generally divided into three stages: first, select some candidate regions on a given image, then extract features from these regions, and finally classify the trained classifiers. Below we introduce each of these three phases.
1) Regional selection
This step is to locate the position of the target. Since the target may appear anywhere in the image, and the size, length, and width of the target are also uncertain, the strategy of sliding the window is initially used to traverse the entire image, and different scales and different aspect ratios need to be set. Although this exhaustive strategy includes all possible locations of the target, the disadvantages are obvious: the time complexity is too high, and too many redundant windows are generated, which also seriously affects the speed and performance of subsequent feature extraction and classification. (Actually, due to the complexity of time, the aspect ratio of the sliding window is generally fixed, so for multi-category target detection with large aspect ratio floating, even sliding window traversal can not be very good. Area)
2) Feature extraction
Factors such as the morphological diversity of the target, the diversity of illumination variations, and the diversity of the background make it difficult to design a robust feature. However, the quality of the extracted features directly affects the accuracy of the classification. (Features commonly used at this stage are SIFT, HOG, etc.)
3) Classifier
There are mainly SVM, Adaboost and so on.
Summary: There are two main problems in traditional target detection: one is that the area selection strategy based on sliding window is not targeted, the time complexity is high, and the window is redundant; the second is that the characteristics of the manual design are not very good for the diversity change. Great.
Second, the deep learning target detection algorithm based on Region Proposal
How do we solve the two main problems of traditional target detection tasks?
For the problem of sliding windows, region proposal provides a good solution. The region proposal is a location in advance to find out where the target may appear in the graph. However, since the region proposal utilizes information such as textures, edges, colors, etc. in the image, it can be guaranteed to maintain a high recall rate when fewer windows (thousands or even hundreds) are selected. This greatly reduces the time complexity of subsequent operations and the candidate window is obtained with a higher quality than the sliding window (the sliding window has a fixed aspect ratio). The more commonly used region proposal algorithms are selective search and edge boxes. If you want to know the region proposal, you can look at PAMI2015's "what makes for effective detection proposals?"
With the candidate area, the rest of the work is actually the work of image classification of the candidate area (feature extraction + classification). For image classification, it must be mentioned that in the 2012 Image Net Large-Scale Visual Challenge (ILSVRC), the machine learning leader Professor Geoffrey Hinton led the student Krizhevsky to reduce the Top-5 error of the ILSVRC classification task to 15.3% using the convolutional neural network. The second place top-5 error using the traditional method is as high as 26.2%. Since then, convolutional neural networks have dominated the image classification task, and the top-5 error of Microsoft's latest ResNet and Google's Inception V4 model has dropped to less than 4%, which has surpassed people's ability in this particular task. Therefore, it is a good choice to use CNN to classify the image after the target is detected.
In 2014, RBG (Ross B. Girshick) used region proposal+CNN to replace the sliding window + manual design features used in traditional target detection, and designed the R-CNN framework, which made a huge breakthrough in target detection and opened a deep learning goal. The craze of detection.
1) R-CNN (CVPR2014, TPAMI2015)
(Region-based Convolution Networks for Accurate Object detection and Segmentation)
The above frame diagram clearly shows the target detection process of R-CNN:
(1) Input test image
(2) Extracting about 2000 region proposals in the image by using the selective search algorithm.
(3) Each region proposal is warpd to a size of 227*227 and input to the CNN, and the output of the fc7 layer of the CNN is taken as a feature.
(4) Input the CNN feature extracted by each region proposal into the SVM for classification.
Give a few explanations for the above framework:
* The above frame diagram is the flow chart of the test. To test, we must first train the CNN model for extracting features, and the SVM for classification: it is fine-tuned with the trained model (AlexNet/VGG16) on yongImageNet for feature The extracted CNN model is then used to train the SVM with the training set feature using the CNN model.
* Scaling to the consent scale for each region proposal is because the CNN full link layer input needs to ensure that the dimensions are fixed.
* Scaling to the consent scale for each region proposal is because the CNN full connection layer input needs to be guaranteed to be fixed.
* The above diagram takes a little less process - for the SVM to classify the region proposal for bordering-box regression, the border regression is a linear regression algorithm that corrects the region proposal, in order to let the region proposal extract the window and target The window is really gentle. Because the window extracted by region proposals can't be as accurate as the manual mark, if the region proposal is cheaper than the target position, even if the classification is correct, but because of the ratio of IoU (region proposal and ground truth window intersection and union) ) below 0.5, then the equivalent is still not detected.
Summary: The test results of R-CNN on PASCAL VOC2007 increased directly from 34.3% of DPM HSC to 66% (mAP). Such a big improvement is that we have seen the huge advantage of region proposal+CNN.
But there are also many problems with the R-CNN framework:
(1) Training is divided into multiple stages, the steps are cumbersome: fine-tuning network + training SVM + training border regression
(2) The training takes time and takes up a lot of disk space: 5000 graphics generate several hundred G feature files
(3) Slow: Using the GPU, the VGG16 model takes 47 seconds to process an image.
For this slow problem, SPP-NET gives a good solution.
2) SPP-NET (ECCV2014, TPAMI2015)
(Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition)
Let's take a look at why R-CNN is so slow, and a picture takes 47s! Take a closer look at the R-CNN framework and find that after the image has been region-proposal (about 2000), each proposal is treated as an image for subsequent processing (CNN feature + SVM classification), and actually an image is taken. 2000 process of feature extraction and classification!
Is there a way to speed up? It seems that there are some, these 2000 region proposals are not part of the image, then we can completely extract the convolution layer features of the image, and then input the circle integration features of each region proposals into the fully connected layer for subsequent operations. (For CNN, most of the operations are good for the operation of the reel, which can save a lot of time). The problem now is that the scale of each region proposal is different. It is definitely not possible to input the fully connected layer directly, because the full connection layer input must be a fixed length. SPP-NET happens to solve this problem:
The figure above corresponds to the network structure diagram of SPP-NET. Any image is input to CNN. After the operation of the volume machine, we can get the features and features (such as the last convolutional layer of VGG16 is conv5_3, a total of 512 feature maps are generated) . The window in the figure is the area where the region proposal corresponds to the feature map. Only the features of these different size windows need to be mapped to the same dimension. As the input of the full connection, the detachment image can be extracted once. feature. SPP-NET uses spatial pyramid pooling: each window is divided into 4*4, 2*2, 1*1 blocks, and then each block is sampled using max-pooling, so that for each window After the SPP layer, a feature vector of length (4*4+2*2+1)*512 is obtained, and this is used as the input of the fully connected layer for subsequent operations.
Summary: Using SPP-NET can greatly speed up target detection compared to R-CNN, but there are still many problems:
(1) Training is divided into multiple stages, the steps are cumbersome: fine-tuning network + training SVM + training border regression
(2) SPP-NET fixes the convolution layer when fine-tuning the network, and only fine-tunes the fully-connected layer. For a new task, it is necessary to fine-tune the convolutional layer. (The features extracted by the model are more focused on high-level semantics, while the target detection task requires the background position of the target in addition to the information)
In response to these two problems, RBG has proposed Fast R-CNN, a streamlined and fast target detection framework.
3) Fast R-NN (ICCV2015)
With the introduction of R-CNN and SPP-NET, let's take a look at the framework of Fast R-CNN:
Compared with the R-CNN framework, it can be found that there are two main differences: one is to add a ROI pooling layer after the last convolution layer, and the other is to use the multi-task loss function to remove the border. The regression is directly added to the CNN network for training.
(1) The ROI pooling layer is actually a simplified version of SPP-NET. SPP-NET uses pyramid maps of different sizes for each proposal, and the ROI pooling layer only needs to downsample to a 7*7 feature map. There are 512 feature maps for the VGG16 network conv5_3, so that all region proposals correspond to a 7*7*512 unique feature vector as the input of the fully connected layer.
(2) The R-CNN training process is divided into three stages, and Fast R-CNN directly uses softmax instead of SVM classification. At the same time, the multi-task loss function border regression is also added to the network, so the training process of the whole network is end-to-end. (Remove the region proposal extraction phase).
(3) In the process of network fine-tuning, Fast R-CNN has also fine-tuned some of the base layers and achieved better detection results.
Summary: Fast R-CNN combines the essence of R-CNN and SPP-NET, and introduces a multi-task loss function, which makes the training and testing of the entire network very convenient. Trained on the Pascal VOC2007 training set, the result of the VOC2007 test was 66.9% (mAP). If the VOC2007+2012 training set training was used, the test result on the VOC2007 was 70% (the expansion of the data set can greatly improve the target detection performance). It takes about 3 seconds for each image to be used with VGG16.
Disadvantages: The extraction of region proposal uses selective search. The target detection time is mostly small (this proposes region proposal 2~3s, and the feature classification only needs 0.32s), which cannot meet the real-time application, and does not realize the real meaning. End-to-end training test (region proposal is extracted first using selective search). So is it possible to directly generate and classify a region proposal using CNN directly? The Faster R-CNN framework is the target detection framework that meets this need.
4) Faster R-CNN (NIP2015)
(Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks)
In region proposal+CN