Image Semantic Segmentation Technology in AI Deep Learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

Image Semantic Segmentation Technology in AI Deep Learning

Image semantic segmentation

Image semantic segmentation means that the machine automatically divides and recognizes the content in the image. For example, if a person takes a photo of a motorcycle, the machine should be able to generate the right picture after the judgment. The red color is marked as human and the green is the car (black indicates back ground ).

Image Semantic Segmentation Technology in AI Deep Learning

Therefore, the meaning of image segmentation for image understanding is like reading an ancient book first.

Before the rapid development of Deeplearning technology, there have been a lot of techniques for image segmentation. The most famous one is the image partitioning method called “Normalized cut”, which is called “N-cut”.

The calculation of N-cut has some formulas for connecting weights. It is not mentioned here. Its idea is mainly to consider the relationship weight between pixels and pixels. According to the given threshold, the image is divided into two.

The following figure is an example of simply describing the relationship information between pixels as a distance and dividing the image according to the distance difference:

Image Semantic Segmentation Technology in AI Deep Learning

In practice, each time you run N-cut, you can only cut the image once. In order to segment multiple objects on the image, you need to run it multiple times. The following figure shows the N-cut 7 times for the original image a. The result of the second split.

Image Semantic Segmentation Technology in AI Deep Learning

However, it can be clearly seen that this simple and rude segmentation method is not accurate. The athlete's limb on the ground is segmented in the b-picture, and his arm is divided in the h-picture, which is obviously wrong.

The flaws in N-cut technology are obvious, so there is an updated optimization method. This optimization method is used to avoid the situation that the machine can't be well segmented like the "clothing and limb color check is too large and the segmentation error is caused" in the above example. , increased human-computer interaction, in the process of segmentation, the need for manual intervention to participate.

This technology that requires human-computer interaction is called Grab Cut.

[Knock on the blackboard]~~ Note that this technology is used in PS.

This technique is actually like this, given a picture, and then manually draw a red frame in the area where you want to map (that is, the segmentation we are talking about), and then the machine will perform a "body calculation" for content slightly smaller than this box. "Well, this "subject calculation" is my name, for you to better understand the complex design and formula behind, because the machine will default to the red box is the result that the user expects, so the middle is used as the main reference, then Eliminate the difference between the main body and the main part, leaving the result.

Image Semantic Segmentation Technology in AI Deep Learning

In this technique, the part that is extracted is called “foreground”, and the part that is eliminated is called “background”.

Sometimes it's quite easy to use, but when it's a bit more complicated, the problem comes up: for example, the following helmet-ridden soldier, the color of the helmet and the color of the rock behind it are very similar, and the machine will remove the helmet part, also near the neck. The mountain rock was also retained as a foreground.

Image Semantic Segmentation Technology in AI Deep Learning

At this point, manual intervention is required. It is necessary to manually mark the image. The white line indicates the foreground that you want to keep, the red indicates the background, and the auxiliary machine is used to judge. After the operation, the better expected result is obtained.

Although it seems that the results given by Grab Cut are not bad, the disadvantages are obvious. First of all, it can only do two types of semantic segmentation like N-cut. It is said that human words can only be divided into one class at a time, non-black and white. Multiple target images are evaluated multiple times. Second, it requires manual intervention. This weakness is simply a dead end in the future.

OK, human wisdom is endless, and DeepLearning is finally beginning to flourish.

Deep learning

Deep learning is a branch of machine learning. It mainly refers to deep neural network algorithms. Deep neural networks have more levels than ordinary neural networks. They can better capture deep relationships in data. The obtained models are more accurate and are mainly used to perform features. Learn.

Don't rush to faint, let's first look at how neural networks work.

The neural network is an artificial neuron system that is modeled after human neurons. It has multiple inputs and single outputs, and the output is used as the input of the next neuron. (Please self-replenish the neuron cells of the claws and enchanting eyes~ What? After the liberal arts students, dragged out ~~)

The figure below shows a single neuron:

Image Semantic Segmentation Technology in AI Deep Learning

Organizing these individual neurons together forms a neural network. The picture below is a three-layer neural network structure:

Image Semantic Segmentation Technology in AI Deep Learning

The leftmost raw input information in the above figure is called the input layer, the rightmost neuron is called the output layer (the output layer has only one neuron in the above figure), and the middle is called the hidden layer.

The number of layers in the deep neural network system is relatively large, reaching 8-10 layers (the number of layers of the common neural network is usually 3-4 layers).

Among the image recognition algorithms previously used, the mainstream technology is Convolutional Neural Networks (CNN). A convolutional neural network is a deep neural network.

But in the 2015 CVPR published a very essay on X (passer A: Is CVPR a god horse? A: CVPR can be simply understood as the most important meeting in this field: International Computer Vision and Pattern Recognition Conference) FCN, Fully Convolutional Networks, was proposed.

Why is this FCN paper very bullish? It seems that there is only one more word. What is wrong with it?

Well, I have to say that it’s really “a little bit worse, a thousand miles away”.

I will come first to help you review the convolution.

I have checked a lot of books, convolution has a variety of formulas, and a variety of derivation algorithms, but in order to reduce the difficult index of this article, so I jump directly to the physical meaning of convolution, not too careful those The formula, in fact, the physical meaning of convolution, is the "weighted overlay."

When convolving image processing, depending on the size of the convolution kernel, there will also be a difference in scale between input and output.

Let's take a look at an animation (just for example)

Image Semantic Segmentation Technology in AI Deep Learning

The 5*5 square on the left side of the above figure is regarded as the image input, the 3*3 of the yellow movement and the number (*1/*0) inside are the convolution kernel, and the convolution kernel is sequentially input from the original in the order of 1 step. The upper left corner has been moved to the bottom right corner and the convolution kernel has moved a total of 9 times.

The position of the nine times corresponds to the corresponding box of 3*3 on the right side, and the number in the cell is the convolution value (here, the result of multiplying and multiplying the elements in the area covered by the convolution kernel).

After the 9th movement calculation is completed, the new matrix of 3*3 on the right side is the calculation result of this convolution layer.

If this is still not very easy to understand, it does not matter, I have a more intuitive approach ^_^.

In the actual calculation process, the input is a raw picture and a filter filter (a set of fixed weights, which is the actual meaning of the convolution kernel we mentioned above). After the inner product, the new two-dimensional data is obtained.

Different filter filters will get different output data, such as contour and color depth. If you want to extract different features of the image, you need to use different filter filter to extract the specific information about the image.

Image Semantic Segmentation Technology in AI Deep Learning

The figure above is a convolution process in a convolutional layer. Note that the content of the convolution kernel is different from the top and bottom, so two processing results are obtained.

The new two-dimensional information to the right of the equal sign is used as the input to the next convolutional layer in the CNN network. That is, when the next convolutional layer is calculated, the image on the right will be used as the input original image.

In the CNN network, a total of 5 convolutional layer calculations are performed.

Passerby: What kind of ghost will you get in the end?

Shen MM: Keke, after the calculation of five consecutive convolutions, this is followed by three fully connected layers.

Passerby A: What is the full connection layer?

Shen MM: Fully connected layer, not a two-dimensional image, but a one-dimensional vector.

Passerby A has been crying in the toilet.

The first two vectors of these three one-dimensional vectors are 4096 in length, and the last vector is 1000 in length.

Why is 1000?

Because there are 1000 classifications in the CNN network, the information in the last 1000 elements of this vector is: the probability that the original input image might be the object in the 1000 categories.

Yes, a picture, for a long time, the last thing is, this picture is the answer to what stuff.

Image Semantic Segmentation Technology in AI Deep Learning

The green arrow on the right indicates the probability of various objects in the vector representing the last fully connected layer. The above image is identified as a car.

Well, the above is the processing of the convolutional neural network CNN (I believe I have simplified it).

So what about a full convolutional neural network?

You should note that CNN's input is an image, and the output is a result, or a value, a probability value.

What the FCN proposes is that the input is a picture, and the output is also a picture, learning pixel-to-pixel mapping.

Image Semantic Segmentation Technology in AI Deep Learning

The upper part of the above figure is the CNN network, and the lower part is the FCN network.

So where is the “full convolution”?

The last three layers in the CNN network are one-dimensional vectors. The calculation method no longer uses convolution, so the two-dimensional information is lost. In the FCN network, all three layers are converted into 1*1 convolution kernels. Corresponding to the multi-channel convolutional layer of the equivalent vector length, the latter three layers are all convolutional. In the whole model, all are convolutional layers, and there is no vector, so it is called "full convolution".

The FCN converts layers 6 and 7 from 4096-length vectors to 4096-channel convolution layers, and layer 8 is the 21-channel convolution layer. The reason why the 8th layer is reduced from 1000 to 21 is because the recognition library used by FCN is PASCAL VOC, there are 20 object classifications in PASCAL VOC, and another background classification. (See Appendix for PASCAL VOC)

To put it another way, the different colors in the split images used in the following (even full text) represent different object categories, a total of 21 colors:

Image Semantic Segmentation Technology in AI Deep Learning

CNN recognition is image-level recognition, that is, from image to result, and FCN recognition is pixel-level recognition. Each pixel of the input image has a corresponding judgment mark on the output, indicating that this pixel is most likely to belong to What object/category is it.

It is important to point out here that in the actual image semantic segmentation test, the input is a three-channel color image of H*W*3, and the output is a matrix of H*W.

This can be simply seen as the information carried by each pixel is multi-dimensional, such as color, it is divided into three layers, corresponding to three values of R, G, B. (I don't know what RGB is, I identified it as a liberal arts student, please evacuate quickly, thank you)

Therefore, when convolution is performed, each channel is calculated independently, and then superimposed after calculation to obtain the final convolutional layer result.

If the step size of the convolution kernel is 1, then the convolution is calculated in terms of pixel arrangement, and the amount of calculation can be imagined how large. However, in practice, adjacent pixels are often one type, which is redundant according to the pixel calculation, so the output is subjected to a pooling process after convolution.

So what is pooling?

Come, let's take a look at an animation:

Image Semantic Segmentation Technology in AI Deep Learning

Pooling is simply to cut the input image. Most of the time we choose non-overlapping regions. If the pooled partition size is h*h and the segmentation step is j, then h=j is the same as above. Figure, if you need to overlap, you only need h>j.

The complete image is segmented, and then the mean or maximum value of all the values in the segmentation area is taken as a new value representing the region, and is placed in the pooled two-dimensional information map. The new map obtained is the pooling result.

In the network model of CNN and FCN, each convolution layer contains [convolution + pooling] processing, which is the legendary "downsampling", but the result of this processing is: the pixel information of the image changes Small, the pixel information of each layer is 1/2 of the previous layer, and when it reaches the fifth layer, the image size is 1/32 of the original image.

In the CNN algorithm, this doesn't matter much, because CNN only outputs one result: "This picture is a 啥", but the FCN is different, FCN is pixel-level recognition, that is, how many pixels are input, how much is the output? Pixels, pixels are fully mapped, and there are information on the output image to indicate what object/category each pixel might be.

So you have to restore this 1/32 image.

Here we use a purely mathematical technique called "deconvolution". Deconvolution of the 5th layer can expand the image to its original size (strictly speaking, it is approximately the original size, generally larger, but will be cropped, Why is the principle of the big one a bit more complicated? I will not mention it here. I will write it later and put it in later.)

- This "reconvolution" is called "upsampling". (corresponding to downsampling)

Image Semantic Segmentation Technology in AI Deep Learning

Technically, we can deconvolve any layer of convolutional layer to get the final image, such as the third layer (8s-8 times magnification), the fourth layer (16s-16 times magnification), fifth Layer (32s-32x magnification) The resulting segmentation result.

Image Semantic Segmentation Technology in AI Deep Learning

Let's look at a comparison of the restored layers, which are:

Image Semantic Segmentation Technology in AI Deep Learning

By contrast, it is obvious that in the 16x reduction and 8x restoration, you can see better details, 32x restored maps, in edge segmentation and recognition, although the general meaning is out, but the details The part (edge) is really rough and can't even see the shape of the object.

Why is this so?

This involves the concept of a receptive field. The shallower convolution layer (before) has a smaller sensory domain, learning the perceptual details

PREVIOUS：Text OCR recognition technology NEXT：Deep learning convolutional neural netwo