Deep learning target detection method

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

The task of the purpose detection is to find out the purpose (objects) in the image that are all interesting, and to confirm their position and size, which is one of the central issues in the category of machine vision. Because various types of objects have different appearances, appearances, poses, and the interference of elements such as illumination and occlusion during imaging, purpose detection is the most challenging problem in the field of machine vision from time to time.
The
There are four major tasks for image recognition in computer vision:
The
Classification-Classification: The question of what is "What?", that is, given a picture or a video to determine what kind of category it contains.
Location-Location: The question of "where?" is the location of this purpose.
Detection-Detection: The question of "what is it? Where is it?" is to locate the purpose and know what the target is.
Segmentation: Segmentation (Instance-level) and Scene-level, which deals with the question of which object or scene each pixel belongs to.

Purpose To detect the central issue to be dealt with

In addition to image classification, the central question for purpose detection is:

1. The purpose may be anywhere in the image today.

2. The purpose has a variety of different sizes.

3. The purpose may have a variety of different shapes.

Assuming that the purpose is defined with a rectangular box, the rectangles have different aspect ratios. Because of the different aspect ratios of the destination, the classical sliding window + image zooming scheme is too expensive for handling general-purpose detection problems.

Purpose detection application

Purpose detection has application requirements in many areas. Among the topics that have been widely studied are face detection, pedestrian detection, vehicle detection, and other important purposes. Face detection was briefly introduced in the SIGAI article "History of Face Recognition Algorithm Evolution". We will write a review article on this issue later.

Pedestrian detection

Pedestrian detection has important positions in video surveillance, traffic statistics, and autopilot. Follow-up articles will also be published.

Vehicle inspection

Vehicle detection plays an important role in intelligent transportation, video surveillance, and automatic driving. Traffic statistics, auto analysis of vehicle violations, etc. are inseparable from it. In autopilot, the first problem to be solved is to determine where the road is and what cars, people or obstacles are around.

other apps

The identification of traffic signs such as traffic lights and driving rules signs is also very important for autopilot. We need to determine whether the traffic is allowed to turn right or left and turn off the number one sign to affirm the behavior of the vehicle.

Traffic sign detection
In addition to the detection of these common purposes, many areas also need to detect the purpose of their own interest. Such as industrial material surface defect detection, hard brush circuit board surface defects detection.

The identification of pests and diseases on agricultural crops in agriculture also requires the use of target detection technology:

Crop pest detection

The application of artificial intelligence in medicine is currently a hot topic. The detection and identification of lesions in medical imaging images such as MRI tumors is of great significance in automating diagnosis and providing quality therapy.

Tumor detection

Purpose detection algorithm

DPM algorithm

Different from the detection of a specific type of purpose, such as face and pedestrian, general purpose detection needs to detect multiple types of images simultaneously, which is more difficult. The classic method of handling this problem is DPM (Deformable Part Model), as its name suggests. This is a deformable component model and is a component-based detection algorithm. The model was proposed by Felzenszwalb in 2008 and published a series of CVPR, NIPS articles, won the third PASCAL VOC purpose test championship, and won the "Lifetime Achievement Award" of 2010 PASCAL VOC.

Prior to the presentation of the deep convolutional neural network (DCNN), the DPM algorithm was the best algorithm for the purpose detection category from time to time. Its basic idea was to first extract the DPM artificial features (as shown in the figure below), and then use the latent SVM classification. This feature extraction method has obvious limitations. First of all, the DPM feature calculation is complex and the calculation speed is slow; secondly, the object detection features of the artificial feature about rotation, stretching, and viewing angle change are poor. These drawbacks limit the application scenario of the algorithm to a great extent.

DPM purpose detection process

Alexnet

The idea of modern deep neural network was proposed by Geoffrey Hinton as early as 2006. Until 2012, Alex Krizhevsky relied on the well-known Alexnet convolutional neural network model to win the ILSVRC2012 image classification competition championship in the first place. Technology really enters the vision of mainstream academic and industrial circles. The presentation of the deep neural network has overturned the traditional method of feature extraction and relied on its strong expression skills. Through rich exercise data and sufficient exercise, it can independently learn useful features. This is a qualitative leap compared to the traditional method of manually discovering features and designing algorithms based on features.

The convolutional neural network can learn the general expression of objects at all levels (about the principles of convolutional neural networks and why it works, SIGAI will be introduced in the next article):

Deep level learning

OverFeat

In 2013, the OverFeat proposed by Zhang xiang et al. in Yann LeCun team of New York University won several firsts in the ILSVRC2013 contest. They improved Alexnet and proposed a method to accomplish multiple tasks using the same convolutional network. This method fully applies the feature extraction function of convolutional neural network. It uses the features extracted from the classification process at the same time for various tasks such as location detection. It only needs to change the last few layers of the network to complete different tasks. Without the need to start from scratch to exercise the parameters of the entire network. This fully demonstrates and exploits the advantages of CNN feature sharing.

The main highlights of this article are:

The shared volume base is used for multi-tasking learning.
Full convolutional network thinking.
The Sliding Window avoids a large number of repeated operations in the feature layer, which is also a classic method used by the subsequent series of algorithms from time to time.
OverFeat several obvious flaws:

The use of a multi-scale greedy windowing strategy has resulted in a large amount of computation.
Because there was not a very good backbone network at the time, the characterization of the shared feature layer was not too strong. Without considering the multi-scale feature fusion, the effect was poor for the small purpose and the overall detection effect was unsatisfactory. The mAP (which can be simply understood as the detection accuracy rate) on the ILSVRC 2013 data set was 24.3%.
One problem with classical convolutional neural networks is that it can only accept fixed-size input images because the size of the weight matrix between the first fully connected layer and its convolutional layers is fixed, and the convolutional layer There is no limit to the size of the input image itself. However, when doing purpose detection, the size of the input candidate area image faced by the convolutional network is not fixed.

The following uses an example to illustrate how to make a DCNN model that has been designed to support the input of arbitrary size pictures. One of the solutions is a full convolutional network (FCN), which removes all the fully connected layers and all of them are replaced by convolutional layers:

Instead of flattening a 5×5 image into a one-dimensional vector and then suspending the calculation, the FCN directly uses 5×5 convolution to check the convolution operation of a whole picture. Such as 16 × 16 size of the feature picture, then what will be the result? See the following diagram:

This time will be found, the final output of the network is a 2 × 2 size feature image. It can be found that FCN network can be used to input arbitrary pictures. The need to pay attention to is that the size of the feature picture output by the network is no longer always 1×1 but is related to the input picture size.

OverFeat has a lot of innovation, but it can't be regarded as the purpose of detecting typical Pipeline, so we put it out separately. The following will introduce the current development of object detection based on DCNN from the beginning of R-CNN.

After the convolutional neural network was used for purpose detection, it stopped at a standstill. In a short period of time, the accuracy of the algorithm was greatly improved and the application of this technology was promoted.

R-CNN

Region CNN (referred to as R-CNN) was proposed by Ross Girshick (a student of Rivers and Lakes called RBG God, a student of Felzenszwalb). It is a milestone in the use of deep learning to stop the detection of targets and lays the foundation for this subcategory. This article was a clear-hearted thought that significantly improved the detection rate (31.4% mAP on the ILSVRC 2013 data set) after years of bottlenecks in the DPM methodology. RBG is the same as God in this category. Follow-up improvements such as Fast R-CNN, Faster R-CNN, and YOLO are related to him.

The main steps in R-CNN detection are:

1. Use the Selective Search algorithm to extract 2000 or so candidate region candidates from the image to be detected. These candidates may contain the purpose to be detected.

2. Scale all candidate boxes to a fixed size (original 227×227).

3. Use DCNN to extract the features of each candidate box to obtain a fixed-length feature vector.

4. The eigenvectors are sent to the SVM to terminate the categorization to get the category information, which is fed into the fully connected network to terminate the regression and get the corresponding position coordinate information.

The reason why R-CNN does not use the sliding window scheme is that the calculation cost is high and a large number of windows to be classified are generated; in addition, the rectangular frames of different types have different aspect ratios, and it is impossible to use a uniform size window to stop the scanning of the image. The convolutional network used to extract features has five convolutional layers and two fully connected layers. The input is a fixed-size RGB image and the output is a 4096-dimensional feature vector. The classification of candidate regions uses a linear support vector machine to calculate the feature vector of each candidate region for each image to be detected and send it to the support vector machine to terminate the classification. At the same time, it is fed into the full convergence network to abort the coordinate position regression.

Although R-CNN is cleverly designed, there are still many flaws:

1. Repeat calculations. Although R-CNN is no longer exhaustive, there are still 2,000 or so candidate boxes through Proposal (Selective Search). These candidate boxes all need to be extracted by backbone network alone. The computational complexity is still large. There will be stacks, so many of them are actually double counting.

2. Exercise test is not simple. Candidate area extraction, feature extraction, classification, and regression are all separate operations. Intermediate data needs to be kept separately.

3. Slow speed. The previous flaws eventually caused R-CNN to be surprisingly slow. It takes more than ten seconds to process a picture on a GPU, and it takes longer on a CPU.

4. The input image patch must be forced to scale to a fixed size (original use 227×227), will constitute a deformation of the object, causing the detection performance to drop.

SPPNet

Afterwards, MSRA Kaiming He et al. proposed SPPNet on the basis of R-CNN. Although this method also depends on the generation of candidate boxes, it transfers the operation of extracting feature vectors of the candidate boxes to the convolutional feature map and terminates. - Multiple convolutions in the CNN become a convolution, greatly reducing the amount of computation (this is referenced to OverFeat).

R-CNN's convolutional network can only accept fixed-size input images. In order to comply with this image size, either the image area of this size is truncated, which will incur an image that does not cover the entire purpose; or the image is suspended and it will be distorted. In a convolutional neural network, the convolutional layer does not imply that the input image is fixed in size. Only the first fully connected layer requires a fixed-size input because the weight matrix between it and the previous layer is a fixed size. The full tie layer does not require the size of the image to be fixed. Assuming some treatment between the last convolutional layer and the first full concatenation layer, the problem can be solved by changing the images of different sizes into fixed-size, fully concatenated layer inputs.

SPPNet introduces the Spatial Pyramid pooling layer, which obtains a fixed-length output of the convolutional feature image abort space pyramid sampling, and extracts the feature-layer oblivious aspect ratio and the scale region abort feature. The meticulous approach is to stop the fixed number of meshes in the feature image region. For different widths and heights of the image, the height and width of each mesh are irregular, and each of the divided meshes is stopped and pooled so that it can be obtained. Fixed length output. The following figure shows the SPP operation:

Compared with R-CNN, SPPNet's detection speed increased by more than 30 times. The following figure is a comparison of the R-CNN and SPPNet detection processes:

The following figure is the principle of SPPNet:

SPPNet inspection framework

SPPNet is the same as R-CNN. Its training needs to go through several stages, and the intermediate features are also suspended. The backbone network parameters are based on the initial parameters of the classification network and are not optimized for detection problems.

Fast RCNN

Ross Girshick proposed FRCNN for further improvements to SPPNet. Its main innovation is the RoI Pooling layer, which uniformly samples convolution plots of candidate boxes of different sizes into fixed-size features. The ROI pooling layer is similar to the SPP layer, but uses only one scale to stop meshing and pooling. The layer can be directly derived, and the gradient can be directly transmitted to the backbone network for optimization during exercise. The FRCNN discontinued improvements to R-CNN and SPPNet as it was a multi-stage exercise and the exercise was time-consuming during exercise. The two stages of the deep network and the subsequent SVM classification are integrated together, and a new network is used to directly perform classification and regression. The network's training time on the Pascal VOC was shortened from 84 hours for R-CNN to 9.5 hours, and the inspection time was reduced from 45 seconds to 0.32 seconds.

PREVIOUS：Mobile OCR identification technology NEXT：Natural scene text detection in deep lea