News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Deep learning convolution network detailed
Deep learning convolution network detailed
background
CNN can stop the classification of pictures, but how can we identify specific local objects in pictures? Before 2015, it was still a world problem. The neural network god Jonathan Long published "Fully Convolutional Networks for Semantic Segmentation" and dug a hole in the Semantic Separation of Images, so the infinite number of people jumped into the pit.
Write a picture here
Full Convolutional Networks Fully Convolutional Networks
CNN and FCN
Usually, the CNN network will connect several concatenation layers after the convolution layer, and map the feature map generated by the convolutional layer into a fixed-length feature vector. The classic CNN construction represented by AlexNet is suitable for image-level classification and regression tasks. Since they all hope to get a numerical representation (probability) of the entire input image, for example, the AlexNet ImageNet model outputs a 1000-dimensional vector representation of the input image. The probability of each class (softmax normalized).
Chestnut: In the image below, enter AlexNet to get an output vector of 1000, which represents the probability that the input image belongs to each class, where the statistical probability of the “tabby cat” is the highest.
Write a picture here
The FCN classifies the image at the pixel level, thereby handling the semantic level of segmentation. Unlike the classical CNN, which uses a fully connected layer behind the convolutional layer to obtain a fixed-length eigenvector stop category (full-joint layer + softmax output), the FCN can accept arbitrary size input images using the deconvolution layer versus the last convolution layer. The feature map stops upsampling, restoring it to the same size as the input image, so that a prediction can be made for each pixel, while the spatial information in the original input image is preserved, and finally stopped on the upsampled feature map. Pixel classification.
Finally, the pixel-by-pixel calculation of the loss of the softmax classification corresponds to one exercise sample per pixel. The following figure shows the structure of the full convolutional network (FCN) adopted by Longjon for semantic segmentation:
Write a picture here
To put it simply, the FCN and CNN regions are replacing the last full layer of CNN with a convolutional layer, and the output is a good image of Label.
Write a picture here
In fact, CNN's strength lies in its multi-layered structure that can automatically learn features and can learn multiple levels of features: shallower convolutional layers have a smaller perceptual domain, and learn some of the characteristics of certain areas; deeper The convolutional layer has a larger perceptual domain and can learn more general features. These general features are less sensitive to the size, position, and orientation of objects, which helps to identify performance improvements. The following figure shows the CNN classification network:
Write a picture here
These general features are very helpful in classifying, and can be used to determine exactly what kind of objects are contained in an image. However, due to the loss of details of some objects, the detailed outline of the object cannot be given, and each pixel is pointed out. It is very difficult to accurately divide the object.
Traditional CNN-based segmentation approach: In order to classify a pixel, an image block around the pixel is used as input to the CNN for exercise and prediction. This approach has several drawbacks: First, the cost of storage is large. For example, the size of the image block used for each pixel is 15x15, and then the window is slid from time to time. Each time the sliding window gives the CNN stop discriminating and classifying, the required storage space increases sharply according to the number and size of sliding windows. The second is inefficient computing. The neighboring pixel blocks are fundamentally iterative, and the convolution is calculated one by one for each pixel block. This calculation also has a large level of repetition. Third, the size of the pixel block limits the size of the sensing area. Usually the size of the pixel block is much smaller than the size of the entire image, and only the features of some parts can be extracted, resulting in the limitation of the performance of the classification.
The full convolutional network (FCN) recovers the category of each pixel from the general characteristics. That is, the classification from the image level further extends to the pixel level classification.
Fully bonded -> Convolute
The only difference between the fully connected layer and the convolutional layer is that the neurons in the convolutional layer only converge with a partial region in the input data, and the neurons in the convolutional column share the parameters. But in both classes, neurons calculate dot products, so their function is the same. Therefore, it is possible to convert the two to each other:
With regard to any convolutional layer, there is a full connect layer that can perform the same forward propagation function as it does. The weight matrix is a large matrix, except for some specific blocks, other parts are zero. In most of them, the elements are all equal.
In contrast, any full tie layer can be converted to a convolutional layer. For example, one
Fully connected layer converts to convolution layer: In both transformations, converting the fully connected layer into a convolution layer is more useful in practical applications. Assuming a convolutional neural network input is
For the first joint area is [7x7x512] full connected layer, so that the filter size is F = 7, so the output data body is [1x1x4096].
For the second fully connected layer, let its filter size be F=1, so the output data volume is [1x1x4096].
Do the same for the last full tie layer so that its F=1 and the final output is [1x1x1000]
In practice, each such transformation requires the reshaping of the weights of the entire concatenation layer into a filter of the convolutional layer. What effect does this kind of transformation have? It can be more efficient under the following conditions: Let the convolutional network slide over a larger input picture to get multiple outputs. Such a transformation allows us to do this in a single forward propagation.
To take a chestnut: If we want to make a 224×224 size floating window slide on a 384×384 image with a step size of 32, bring each paused position to the convolution network and finally get 6×6 The category score for the location. The above-mentioned conversion of the fully connected layer into a convolutional layer is more cumbersome. If the 224 × 224 input image passes through the [7x7x512] array after the convolution and downsampling layers, the 384 × 384 large image will be [12x12x512] directly after passing through the same convolution and downsampling layers. Arrays. Then go through the three convolutional layers that were converted from the three full tie layers above, and finally get [6x6x1000] output ((12 – 7)/1 + 1 = 6). This result is the score of the 6×6 positions where the floating window stopped in the original image!
In the face of 384×384 images, the initial convolutional neural network (with full articulation layer) is independently evaluated at 224×224 blocks in the image with a resolution of 32 pixels. The effect and application of the convolutional neural network are transformed into The convolutional convolutional neural network stops the same forward propagation.
Evaluating the original ConvNet (with FC layers) automatically across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.
As shown in the figure below, the FCN converts the full convergence layer in the traditional CNN into a convolutional layer, corresponding to the CNN network FCN transforming the last three full convergence layers into a three-layer convolutional layer. In the traditional CNN structure, the first 5 layers are convolutional layers, the 6th and 7th layers are one-dimensional vectors of length 4096, and the 8th layer is one-dimensional vectors of length 1000, corresponding to 1000 different. The probability of the category. The FCN represents these three layers as a convolutional layer whose size (number of channels, width, height) is (4096, 1, 1), (4096, 1, 1), (1000, 1, 1), respectively. There does not seem to be any difference in numbers, but convolution is not the same concept and calculation process as full convergence. It uses the weights and biases that CNN has exercised before, but the difference is that the weights and biases are There is my own scope and it belongs to one of my convolution kernels. Therefore, all layers in the FCN network are convolutional layers, so it is called a full convolutional network.
Write a picture here
The figure below shows a full convolutional layer. The image is not the same size as the image above. The size of the image input in the CNN is the image that agrees to resize to 227x227. The first layer after the pooling is 55x55. The image size after pooling is 27x27, and the image size after the fifth layer pooling is 13*13. The FCN input image is H*W size, the first layer after pooling becomes 1/4 of the original image size, the second layer becomes 1/8 of the original image size, and the fifth layer becomes 1/1 of the original image size. 16, the eighth floor becomes 1/32 of the original image size (revision: in fact, the first layer of the real code is 1/2, and so on).
Write a picture here
After repeated convolution and pooling, the resulting image is getting smaller and smaller and the resolution is getting lower and lower. Where the image is
Write a picture here
The final output is a picture of 1000 heatmaps that have been upsampled to the original image size. To stop the categorical prediction label for each pixel and to finally stop the semantically segmented image, there is a small trick that is pixel by pixel. The 1000-image maximum pixel value position (probability) is used as the pixel classification. This resulted in a picture that was once categorized. There are pictures of dogs and cats on the right side of the picture below.
Write a picture here
Upsampling
Compared to using the original convolutional neural network before transformation to stop iterative calculations for all 36 locations, it is much more efficient to use the transformed convolutional neural network to stop a forward propagation calculation, since 36 calculations are all in the shared computing resources. . This technique is often used in theory to achieve better results once. For example, one image size is usually larger, and then the transformed convolutional neural network is used to stop appraising many different locations in space to obtain a classification score, and then the uniform values of these scores are sought.
Finally, what if we want to use a floating window with a step size of less than 32? It can be handled with repeated forward propagation. For example, we want to use a floating window with a step size of 16. Then use the original image to perform forward propagation in the transformed convolutional network, and then translate the original image by 16 pixels along the width, along the height, and finally along the width and height, and then take these translated figures separately. Into the convolutional network.
Write a picture here
As shown in the figure below, when a picture is processed in a network and becomes a smaller picture, its characteristics are more obvious, just like the colors in the picture. Of course, the last picture is no longer a pixel. The picture, but the picture of the original image H/32xW/32, is drawn here as a pixel for simplicity.
Write a picture here
As shown in the figure below, the original image is reduced to 1/2 after convolution conv1 and pool1 are stopped for the original image; after the image is stopped for the second time, conv2 and pool2 are reduced to 1/4; then the image is stopped for the third time. Product operations conv3, pool3 reduced to 1/8 of the original image, this time save pool3 featureMap; then continue to stop the image of the fourth convolution operation conv4, pool4, reduced to 1/16 of the original image, save pool4 featureMap; Finally, the fifth convolution operation conv5 and pool5 is stopped for the image, which is reduced to 1/32 of the original image. Then the full concatenation in the original CNN operation is changed to the convolution operation conv6, conv7. The number of featureMap of the image is changed but the image size is still For the original image of 1/32, the image is no longer called featureMap but called heatMap.
Now that we have a 1/32 size heatMap, a 1/16 size featureMap, and a 1/8 size featureMap, after the 1/32 size heatMap stops the upsampling operation, the restored image is only the convolution kernel in conv5 after this operation. The features in , limited to the accuracy problem can not restore the features in the image well, so iterate forward here. Conv4 convolution check the last upsampling after the graph stops deconvolving added details (equivalent to a difference process), and finally conv3 convolution kernel after the other upsampling image stops deconvolving again to add details, and finally This completes the restoration of the entire image.
Write a picture here
The
defect
Here we must pay attention to the flaws of FCN:
The result is still not precise enough. Stopping 8 times upsampling is actually much better than 32 times, but the result of upsampling is still vague and warring, and it is not sensitive to the details in the image.
It stops the classification of each pixel and does not fully think about the relationship between pixels and pixels. The spatial regularization steps used in the usual pixel-based segmentation methods are neglected, and there is a lack of spatial divergence.
CNN can stop the classification of pictures, but how can we identify specific local objects in pictures? Before 2015, it was still a world problem. The neural network god Jonathan Long published "Fully Convolutional Networks for Semantic Segmentation" and dug a hole in the Semantic Separation of Images, so the infinite number of people jumped into the pit.
Write a picture here
Full Convolutional Networks Fully Convolutional Networks
CNN and FCN
Usually, the CNN network will connect several concatenation layers after the convolution layer, and map the feature map generated by the convolutional layer into a fixed-length feature vector. The classic CNN construction represented by AlexNet is suitable for image-level classification and regression tasks. Since they all hope to get a numerical representation (probability) of the entire input image, for example, the AlexNet ImageNet model outputs a 1000-dimensional vector representation of the input image. The probability of each class (softmax normalized).
Chestnut: In the image below, enter AlexNet to get an output vector of 1000, which represents the probability that the input image belongs to each class, where the statistical probability of the “tabby cat” is the highest.
Write a picture here
The FCN classifies the image at the pixel level, thereby handling the semantic level of segmentation. Unlike the classical CNN, which uses a fully connected layer behind the convolutional layer to obtain a fixed-length eigenvector stop category (full-joint layer + softmax output), the FCN can accept arbitrary size input images using the deconvolution layer versus the last convolution layer. The feature map stops upsampling, restoring it to the same size as the input image, so that a prediction can be made for each pixel, while the spatial information in the original input image is preserved, and finally stopped on the upsampled feature map. Pixel classification.
Finally, the pixel-by-pixel calculation of the loss of the softmax classification corresponds to one exercise sample per pixel. The following figure shows the structure of the full convolutional network (FCN) adopted by Longjon for semantic segmentation:
Write a picture here
To put it simply, the FCN and CNN regions are replacing the last full layer of CNN with a convolutional layer, and the output is a good image of Label.
Write a picture here
In fact, CNN's strength lies in its multi-layered structure that can automatically learn features and can learn multiple levels of features: shallower convolutional layers have a smaller perceptual domain, and learn some of the characteristics of certain areas; deeper The convolutional layer has a larger perceptual domain and can learn more general features. These general features are less sensitive to the size, position, and orientation of objects, which helps to identify performance improvements. The following figure shows the CNN classification network:
Write a picture here
These general features are very helpful in classifying, and can be used to determine exactly what kind of objects are contained in an image. However, due to the loss of details of some objects, the detailed outline of the object cannot be given, and each pixel is pointed out. It is very difficult to accurately divide the object.
Traditional CNN-based segmentation approach: In order to classify a pixel, an image block around the pixel is used as input to the CNN for exercise and prediction. This approach has several drawbacks: First, the cost of storage is large. For example, the size of the image block used for each pixel is 15x15, and then the window is slid from time to time. Each time the sliding window gives the CNN stop discriminating and classifying, the required storage space increases sharply according to the number and size of sliding windows. The second is inefficient computing. The neighboring pixel blocks are fundamentally iterative, and the convolution is calculated one by one for each pixel block. This calculation also has a large level of repetition. Third, the size of the pixel block limits the size of the sensing area. Usually the size of the pixel block is much smaller than the size of the entire image, and only the features of some parts can be extracted, resulting in the limitation of the performance of the classification.
The full convolutional network (FCN) recovers the category of each pixel from the general characteristics. That is, the classification from the image level further extends to the pixel level classification.
Fully bonded -> Convolute
The only difference between the fully connected layer and the convolutional layer is that the neurons in the convolutional layer only converge with a partial region in the input data, and the neurons in the convolutional column share the parameters. But in both classes, neurons calculate dot products, so their function is the same. Therefore, it is possible to convert the two to each other:
With regard to any convolutional layer, there is a full connect layer that can perform the same forward propagation function as it does. The weight matrix is a large matrix, except for some specific blocks, other parts are zero. In most of them, the elements are all equal.
In contrast, any full tie layer can be converted to a convolutional layer. For example, one
Fully connected layer converts to convolution layer: In both transformations, converting the fully connected layer into a convolution layer is more useful in practical applications. Assuming a convolutional neural network input is
For the first joint area is [7x7x512] full connected layer, so that the filter size is F = 7, so the output data body is [1x1x4096].
For the second fully connected layer, let its filter size be F=1, so the output data volume is [1x1x4096].
Do the same for the last full tie layer so that its F=1 and the final output is [1x1x1000]
In practice, each such transformation requires the reshaping of the weights of the entire concatenation layer into a filter of the convolutional layer. What effect does this kind of transformation have? It can be more efficient under the following conditions: Let the convolutional network slide over a larger input picture to get multiple outputs. Such a transformation allows us to do this in a single forward propagation.
To take a chestnut: If we want to make a 224×224 size floating window slide on a 384×384 image with a step size of 32, bring each paused position to the convolution network and finally get 6×6 The category score for the location. The above-mentioned conversion of the fully connected layer into a convolutional layer is more cumbersome. If the 224 × 224 input image passes through the [7x7x512] array after the convolution and downsampling layers, the 384 × 384 large image will be [12x12x512] directly after passing through the same convolution and downsampling layers. Arrays. Then go through the three convolutional layers that were converted from the three full tie layers above, and finally get [6x6x1000] output ((12 – 7)/1 + 1 = 6). This result is the score of the 6×6 positions where the floating window stopped in the original image!
In the face of 384×384 images, the initial convolutional neural network (with full articulation layer) is independently evaluated at 224×224 blocks in the image with a resolution of 32 pixels. The effect and application of the convolutional neural network are transformed into The convolutional convolutional neural network stops the same forward propagation.
Evaluating the original ConvNet (with FC layers) automatically across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.
As shown in the figure below, the FCN converts the full convergence layer in the traditional CNN into a convolutional layer, corresponding to the CNN network FCN transforming the last three full convergence layers into a three-layer convolutional layer. In the traditional CNN structure, the first 5 layers are convolutional layers, the 6th and 7th layers are one-dimensional vectors of length 4096, and the 8th layer is one-dimensional vectors of length 1000, corresponding to 1000 different. The probability of the category. The FCN represents these three layers as a convolutional layer whose size (number of channels, width, height) is (4096, 1, 1), (4096, 1, 1), (1000, 1, 1), respectively. There does not seem to be any difference in numbers, but convolution is not the same concept and calculation process as full convergence. It uses the weights and biases that CNN has exercised before, but the difference is that the weights and biases are There is my own scope and it belongs to one of my convolution kernels. Therefore, all layers in the FCN network are convolutional layers, so it is called a full convolutional network.
Write a picture here
The figure below shows a full convolutional layer. The image is not the same size as the image above. The size of the image input in the CNN is the image that agrees to resize to 227x227. The first layer after the pooling is 55x55. The image size after pooling is 27x27, and the image size after the fifth layer pooling is 13*13. The FCN input image is H*W size, the first layer after pooling becomes 1/4 of the original image size, the second layer becomes 1/8 of the original image size, and the fifth layer becomes 1/1 of the original image size. 16, the eighth floor becomes 1/32 of the original image size (revision: in fact, the first layer of the real code is 1/2, and so on).
Write a picture here
After repeated convolution and pooling, the resulting image is getting smaller and smaller and the resolution is getting lower and lower. Where the image is
Write a picture here
The final output is a picture of 1000 heatmaps that have been upsampled to the original image size. To stop the categorical prediction label for each pixel and to finally stop the semantically segmented image, there is a small trick that is pixel by pixel. The 1000-image maximum pixel value position (probability) is used as the pixel classification. This resulted in a picture that was once categorized. There are pictures of dogs and cats on the right side of the picture below.
Write a picture here
Upsampling
Compared to using the original convolutional neural network before transformation to stop iterative calculations for all 36 locations, it is much more efficient to use the transformed convolutional neural network to stop a forward propagation calculation, since 36 calculations are all in the shared computing resources. . This technique is often used in theory to achieve better results once. For example, one image size is usually larger, and then the transformed convolutional neural network is used to stop appraising many different locations in space to obtain a classification score, and then the uniform values of these scores are sought.
Finally, what if we want to use a floating window with a step size of less than 32? It can be handled with repeated forward propagation. For example, we want to use a floating window with a step size of 16. Then use the original image to perform forward propagation in the transformed convolutional network, and then translate the original image by 16 pixels along the width, along the height, and finally along the width and height, and then take these translated figures separately. Into the convolutional network.
Write a picture here
As shown in the figure below, when a picture is processed in a network and becomes a smaller picture, its characteristics are more obvious, just like the colors in the picture. Of course, the last picture is no longer a pixel. The picture, but the picture of the original image H/32xW/32, is drawn here as a pixel for simplicity.
Write a picture here
As shown in the figure below, the original image is reduced to 1/2 after convolution conv1 and pool1 are stopped for the original image; after the image is stopped for the second time, conv2 and pool2 are reduced to 1/4; then the image is stopped for the third time. Product operations conv3, pool3 reduced to 1/8 of the original image, this time save pool3 featureMap; then continue to stop the image of the fourth convolution operation conv4, pool4, reduced to 1/16 of the original image, save pool4 featureMap; Finally, the fifth convolution operation conv5 and pool5 is stopped for the image, which is reduced to 1/32 of the original image. Then the full concatenation in the original CNN operation is changed to the convolution operation conv6, conv7. The number of featureMap of the image is changed but the image size is still For the original image of 1/32, the image is no longer called featureMap but called heatMap.
Now that we have a 1/32 size heatMap, a 1/16 size featureMap, and a 1/8 size featureMap, after the 1/32 size heatMap stops the upsampling operation, the restored image is only the convolution kernel in conv5 after this operation. The features in , limited to the accuracy problem can not restore the features in the image well, so iterate forward here. Conv4 convolution check the last upsampling after the graph stops deconvolving added details (equivalent to a difference process), and finally conv3 convolution kernel after the other upsampling image stops deconvolving again to add details, and finally This completes the restoration of the entire image.
Write a picture here
The
defect
Here we must pay attention to the flaws of FCN:
The result is still not precise enough. Stopping 8 times upsampling is actually much better than 32 times, but the result of upsampling is still vague and warring, and it is not sensitive to the details in the image.
It stops the classification of each pixel and does not fully think about the relationship between pixels and pixels. The spatial regularization steps used in the usual pixel-based segmentation methods are neglected, and there is a lack of spatial divergence.