A common method of image segmentation in deep learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

A common method of image segmentation in deep learning

Deep learning in a variety of advanced computer vision task success - especially the supervision of the CNNs (Convolutional Neural Networks, convolutional neural network) in image classification, object detection success inspired researchers to explore such networks for pixel level markers, such as semantic segmentation ability. The outstanding advantages of this kind of deep learning technology, compared with the traditional methods, can automatically learn the appropriate feature representation for the current problems. Traditional methods usually use manual features. In order to adapt to a new set of data, it usually requires expert experience and time to adjust the features.

In this article, the characteristics of the image segmentation algorithm based on the depth learning are compared (the details are viewed in the original).

The most successful technology of image segmentation is based on a common Pioneer: FCN (Fully Convolutional Network, full convolution neural network). CNNs is a very efficient visual tool that can learn hierarchical features.
The researchers replace the full link layer to the convolution layer to output a spatial domain mapping rather than a class score, so that the existing and well-known classification models, such as AlexNet, VGG, GoogleNet and ResNet, can be transformed into a complete convolution model.
These maps use fractional convolution (fractionally strided convolutions, also known as deconvolution) to produce pixel level label output.
This is considered to be a milestone in image segmentation. It shows the training method of CNNs applied to end-to-end image segmentation and how to efficiently learn the dense predictions for any size image. It is considered to be the mainstay of deep learning semantic segmentation in all kinds of standard data sets.
FCN's shortcomings:

Spatial independence: the global context information can not be considered effectively.
The gap between individuals of the same kind: by default, the gap between individuals is not considered (not the same category of individuals can not be distinguished).
Real time: the high resolution image is far from the real time.
Unstructured data: not fully adapted to 3D point cloud - like unstructured data or models.
A variant of the 4.1 decoder (Decoder variants)

Coding: the process of producing a low resolution image representation or feature mapping, and converting the image into a feature.
Decoding: the process of mapping a low resolution image to a pixel level label, and transforming the feature into an image label.
In addition to FCN, there are other ways to use the classified neural network for semantic segmentation. The decoding process is usually a significant difference in these methods.

SegNet is one of the obvious examples.

SegNet consists of a series of upper sampling and coiling layers, and finally a softmax classifier predicts pixel level tags with the same output as the input size. Each of the upper sampling layers in the decoding phase corresponds to the maximum pool of the coding phase (the corresponding rule of upper sampling and maximum pooling). These layers use the feature mapping corresponding to the maximum pooling of the coding phase, mapping the features to tags. The result of the sample on the feature graph is convoluted with a set of trained filter banks to produce a dense feature graph. When the feature graph is restored to the original resolution, the final segmentation result is generated by the softmax classifier.

The FCN based architecture uses a learning deconvolution filter to get the upper sampling feature graph. The feature graph of the upper sample is then added to the feature map of the element level generated by the coding phase. Figure 10 shows the differences between the two approaches.
In SegNet, when the maximum pool is used, the location of the maximum value is recorded, so that when sampling is used, the same location relation is used to map it to the larger feature map, and the rest part is filled with zero.
In FCN, the deconvolution kernel [a, B, C, d] are learned, and it's with [x1, X2,... X16], after deconvolution, is added to [y1, Y2,... Y16].

4.2 combining context knowledge

Semantic segmentation is a problem that requires a variety of spatial domain information. This also requires a balance of local and global information.

Fine-grained and local information can obtain better pixel level accuracy.
The information of the context of the image helps to remove the ambiguity of the local features.
The pooled layer can be a network to obtain a certain degree of spatial independence, reduce the amount of computation, and obtain global context information. Even as a pure CNNs network, if there is no pool layer, the receptive field of each unit can only increase linearly as the number of layers increases.
There are many ways to take account of global information for CNNs. In the post processing stage, CRFs (Conditional Random Fields, conditional random fields), expansion convolution, multi-scale polymerization, and context model are added to RNNs.
4.2.1 conditional random field

Spatial transformation independence of CNN architecture limits the spatial accuracy of the segmentation task.
CRFs can combine the low-level image information, such as the interaction between pixels, and the output of a multi class inference system, that is, the label of each pixel. This combination is very important for obtaining a wide range of dependencies.
The DeepLab model uses the full connection pair level CRF as an independent post processing stage to enhance the segmentation effect. Each pixel is modeled as a node in the field, and a pair of level items (dense or fully connected factor diagrams) are used for each pair of pixels, no matter how far.
Considering the interaction between short and long range, it helps the system to fix the details.
The full connection model is relatively inefficient and can be replaced by probabilistic reasoning

4.2.2 expansion convolution (Dilated Convolutions)

The expansion convolution is a generalization of the Kronecker factor convolution filter, which can increase the receptive field exponentially without losing the resolution.

PREVIOUS：Ten kinds of depth learning algorithms NEXT：The framework of deep learning and machi