Target detection and positioning in deep learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

In the field of computer vision, the purpose of testing mainly involves two central issues: where is the purpose? (Positioning) Second, what is the purpose? (identify).

The category identified is relatively easy to understand, ie, extracting features -> identifying features -> comprehensively identifying objects (classification).

The problem of positioning is not very easy to understand because there is not much contact before. After all, the neural network stops processing the pixel values. How does it proceed to the position of the associated pixel? Write this blog while reading the material and sort out the learning.

In the traditional deep learning purpose detection, the task of positioning is not to hand over to the neural network, but to directly use some traditional algorithms (principles and color, texture, gradient correlation) to extract many candidate regions, and then based on some filtering algorithms or filtering principles. , Filter or merge most of the unwanted areas. After that, these candidate areas are resized to the image size and put into the network for identification.

Extracting Candidate Regions -> Filter & Merge -> Neural Network Recognition Classification

The extracted candidate regions have selected search algorithms, etc., which are based on colors, textures, etc., and have certain defects: the robustness is not enough.

Filtering and Merging Commonly used NMS algorithms discard boxes that are smaller than the area threshold, beyond the borders of the image border, and the merger intersects beyond a certain proportion of the box.

Such algorithms have RCNN and the like.

The subsequent sppnet mainly stops the improvement of the RCNN in two aspects. The first is to abandon the original wrap operation. Since the last layer of the RCNN is the FC layer, the neuron is preset, and the input size must be fixed. Yes, the original RCNN's wrap operation will deform the objects in the picture and affect the recognition rate.

From the figure above you can see the results of the crop and wrap operations.

Sppnet is a condition that uses different pool combinations to reach the number of matching neurons in the FC layer.

The above is taken from the picture in the thesis. For example, if the FC layer is fixed to 21, three different pools are set. The max_pool is continuously connected to the entire image, the second part is divided into four parts, and the pool is stopped. Divide into sixteen small blocks and stop the pool. Finally, get 16+4+1 exactly 21 features. There is a doubt here is how this pool to ensure that the ss box position information is not lost? In fact, many blogs are omitted here. The pyramid pool itself is not a pool for the entire feature maps. Before this step, the second feature of sppnet is applied: the ss algorithm is calculated in the original image (x, y, w, h). Stopping projection mapping onto feature maps that have been made in a series of convolutional pools is based on the relative invariability of the position. Then the pyramid pool stops the aforementioned merger and acquisition of each proposal and puts it into the FC layer.

After sppnet is improved in this way, the speed has been accelerated many times. Since the original RCNN needs to extract proposal one by one into the net, sppnet directly extracts all the proposals from the pyramid. One map may have 2k proposals, and both can be seen. The gap between speeds.

PREVIOUS：Human Face Verification Machine NEXT：Image Classification and Image Segmentat