Deep learning convolutional neural network

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

Deep learning convolutional neural network is a kind of artificial neural network, which has become a research hotspot in the field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to biological neural networks, reducing the complexity of the network model and reducing the number of weights. This advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input of the network, avoiding the complicated feature extraction and data reconstruction process in the traditional recognition algorithm. A convolutional network is a multi-layer perceptron specially designed to recognize two-dimensional shapes. This network structure is highly invariant to translation, scaling, tilting, or common forms of deformation.

CNNs are affected by early delayed neural networks (TDNN). Delayed neural networks reduce learning complexity by sharing weights in the time dimension, and are suitable for the processing of speech and time series signals.

CNNs is the first learning algorithm to successfully train a multi-layer network structure. It uses spatial relationships to reduce the number of parameters that need to be learned to improve the training performance of the general forward BP algorithm. CNNs are proposed as a deep learning architecture to minimize the preprocessing requirements of data. In CNN, a small portion of the image (locally perceived region) is used as the input to the lowest layer of the hierarchical structure, and the information is sequentially transmitted to different layers, each layer passing through a digital filter to obtain the most significant features of the observed data. This method is capable of acquiring salient features of the invariant observation data for translation, scaling, and rotation, since the local perceptual region of the image allows the neuron or processing unit to access the most basic features, such as directional edges or corner points.

1) History of deep learning convolutional neural networks

In 1962, Hubel and Wiesel proposed the concept of receptive field by studying the visual cortical cells of cats. In 1984, the neocognitron proposed by Japanese scholar Fukushima based on the concept of receptive field could be regarded as a deep learning volume. The first implementation network of the neural network is also the first application of the concept of receptive field in the field of artificial neural networks. The neurocognitive machine decomposes a visual pattern into a number of sub-patterns (features) and then proceeds to a hierarchically-level connected feature plane for processing. It attempts to model the visual system so that it can be displaced or slightly deformed even if the object is displaced. At the time of the identification, the identification can also be completed.

Usually, the neurocognitive machine contains two types of neurons, namely, S-members that undertake feature extraction and C-members that are resistant to deformation. The S-element involves two important parameters, the receptive field and the threshold parameter. The former determines the number of input connections, while the latter controls the degree of response to the feature sub-pattern. Many scholars have been working on improving the performance of neurocognitive machines: in traditional neurocognitive machines, the amount of visual blur caused by C-elements in the photosensitive region of each S-element is normally distributed. If the blurring effect produced by the edge of the photosensitive zone is larger than that of the center, the S-element will accept greater deformation tolerance caused by this non-normal blur. What we hope to gain is that the difference between the training mode and the deformation stimulation mode at the edge of the receptive field and the effect produced by its center becomes larger and larger. In order to effectively form such a non-normal blur, Fukushima proposed an improved neurocognitive machine with a double C-ary layer.

Van Ooyen and Niehuis introduced a new parameter to improve the ability to distinguish between neurocognitive machines. In fact, this parameter acts as a suppression signal that suppresses the excitation of neurons by repeated excitation features. Most neural networks memorize training information in weights. According to the Hebb learning rule, the more times a certain feature is trained, the easier it is to be detected in the subsequent recognition process. Some scholars have combined evolutionary computation theory with neurocognitive machines, and by weakening the training and learning of repetitive incentive features, the network pays attention to those different features to help improve the discriminating ability. All of the above are the development process of neurocognitive machines, and deep learning convolutional neural networks can be regarded as a generalized form of neurocognitive machines, which is a special case of deep learning convolutional neural networks.

2) Network structure of deep learning convolutional neural network

The deep learning convolutional neural network is a multi-layered neural network, each layer consisting of multiple two-dimensional planes, and each plane consists of multiple independent neurons.

Figure: Conceptual demonstration of a deep learning convolutional neural network: the input image is convolved with three trainable filters and an addable bias. The filtering process is shown in Figure 1. After convolution, three feature maps are generated at the C1 layer. Then, the four pixels of each group in the feature map are summed, weighted, and offset, and the feature maps of the three S2 layers are obtained by a Sigmoid function. These maps are then filtered to obtain the C3 layer. This hierarchical structure produces S4 again like S2. Finally, these pixel values are rasterized and concatenated into a vector input to a traditional neural network to get the output.

Generally, the C layer is a feature extraction layer, and the input of each neuron is connected with the local receptive field of the previous layer, and the local feature is extracted, and once the local feature is extracted, the positional relationship between it and other features is also It is determined accordingly; the S layer is the feature mapping layer, and each computing layer of the network is composed of multiple feature maps, each feature is mapped to a plane, and the weights of all neurons on the plane are equal. The feature mapping structure uses a small sigmoid function that affects the function kernel as the activation function of the convolutional network, so that the feature map has displacement invariance.

In addition, since the neurons on one mapping surface share the weight, the number of free parameters of the network is reduced, and the complexity of network parameter selection is reduced. Each feature extraction layer (C-layer) in the deep learning convolutional neural network is followed by a computational layer (S-layer) for local averaging and secondary extraction. This unique two-feature extraction structure Enables the network to have high distortion tolerance for input samples when it is identified.

3) About parameter reduction and weight sharing

As mentioned above, it seems that CNN is a powerful place to reduce the number of parameters that the neural network needs to train through receptive field and weight sharing. What is it?

Left: If we have 1000x1000 pixels and there are 1 million hidden layer neurons, then if they are fully connected (each hidden layer neurons are connected to each pixel of the image), there is 1000x1000x1000000=10^ 12 connections, which is 10^12 weight parameters. However, the spatial connection of the image is partial, just as a human feels the external image through a local receptive field. Each neuron does not need to feel the global image. Each neuron only feels the local image area, and then At the higher level, these neurons with different localities can be combined to get global information. In this way, we can reduce the number of connections, that is, reduce the number of weight parameters that the neural network needs to train. As shown in the following figure: If the local receptive field is 10x10, each receptive field of the hidden layer only needs to be connected with the local image of 10x10, so there are only 100 million connections for 1 million hidden layer neurons, that is, 10^8 parameters. . Compared with the original four (0 orders of magnitude), it is not so laborious to train, but still feel a lot, then there is no way?

We know that each neuron in the hidden layer is connected to 10x10 image regions, that is, each neuron has 10x10=100 connection weight parameters. So what if the 100 parameters of each of our neurons are the same? That is to say, each neuron uses the same convolution kernel to deconvolute the image. So how many parameters do we have? ? Only 100 parameters! ! ! Dear! Regardless of the number of neurons in your hidden layer, I have only 100 parameters for the connection between the two layers! Dear! This is the weight sharing! Dear! This is the main selling point of deep learning convolutional neural networks! Dear! (A little annoying, huh, huh) Maybe you will ask, is it reliable? Why is it feasible? This... study together.

Ok, you will think, this is not a good way to extract features, so you only extract a feature? Yes, really smart, we need to extract multiple features right? A filter, a convolution kernel, is a feature of an image, such as an edge in a certain direction. Then we need to extract different features, what should we do, plus more filters are not enough? correct. So suppose we add to 100 filters, each with different parameters, indicating that it presents different features of the input image, such as different edges. In this way, each filter deconvolutes the image to obtain a different feature of the image, which we call Feature Map. So 100 convolution kernels have 100 Feature Maps. These 100 Feature Maps form a layer of neurons. It’s clear at this time. How many parameters do we have on this layer? 100 convolution kernels x Each convolution kernel shares 100 parameters = 100x100 = 10K, which is 10,000 parameters. Only 10,000 parameters! Dear! (Come on, can't stand it!) See the picture below: Different colors express different filters.

Oh, I missed a problem. It has been said that the number of parameters of the hidden layer is independent of the number of neurons in the hidden layer, and is only related to the size of the filter and the type of the filter. So how is the number of neurons in the hidden layer determined? It is related to the original image, that is, the size of the input (the number of neurons), the size of the filter, and the sliding step size of the filter in the image! For example, my image is 1000x1000 pixels, and the filter size is 10x10. Imagine that the filters do not overlap, that is, the step size is 10, so the number of neurons in the hidden layer is (1000x1000) / (10x10) = 100x100 neurons. Now, suppose the step size is 8, that is, the convolution kernel will overlap two pixels, then... I won't forget it, just understand it. Note that this is just a filter, which is the number of neurons in a Feature Map, if 100 Feature Maps are 100 times. It can be seen that the larger the image, the greater the difference between the number of neurons and the number of weight parameters that need to be trained.

It should be noted that the above discussion does not consider the offset portion of each neuron. Therefore, the number of weights needs to be increased by 1. This is also shared by the same filter.

In short, the core idea of the convolutional network is to combine the three kinds of structural ideas: local receptive field, weight sharing (or weight copying) and temporal or spatial sub-sampling to obtain a certain degree of displacement, scale and deformation invariance. .

4) A typical example description

A typical convolutional network for identifying numbers is LeNet-5 (see effect and paper, etc.). Most banks in the United States used it to identify handwritten numbers on checks. The ability to achieve this commercial level, its accuracy can be imagined. After all, the current combination of academia and industry is the most controversial.

Let's use this example to illustrate below.

LeNet-5 has 7 layers, no input, and each layer contains training parameters (connection weights). The input image is 32*32 size. This is larger than the largest letter in the Mnist database (a recognized handwritten database). The reason for this is to hope that potential obvious features such as stroke power outages or corner points can occur at the center of the highest level feature monitor receptive field.

Let's first make it clear: each layer has multiple Feature Maps, each of which extracts a feature of the input through a convolution filter, and then each Feature Map has multiple neurons.

The C1 layer is a convolutional layer (why is convolution? An important feature of convolution operations is that convolution operations can enhance the original signal characteristics and reduce noise), consisting of six feature maps. Each neuron in the feature map is connected to a 5*5 neighborhood in the input. The size of the feature map is 28*28, which prevents the input connection from falling out of the boundary (for the calculation of BP feedback, without gradient loss, personal opinion). C1 has 156 trainable parameters (5*5=25 unit parameters and one bias parameter for each filter, a total of 6 filters, a total of (5*5+1)*6=156 parameters), a total of 156* (28*28)=122,304 connections.

The S2 layer is a downsampling layer (why is downsampling? Using the principle of image local correlation, subsampling the image can reduce the amount of data processing while retaining useful information), there are six 14*14 feature maps. Each cell in the feature map is connected to the 2*2 neighborhood of the corresponding feature map in C1. The four inputs of each cell in the S2 layer are summed, multiplied by a trainable parameter, plus a trainable offset. The result is calculated by the sigmoid function. The trainable coefficients and biases control the degree of nonlinearity of the sigmoid function. If the coefficient is small, the operation approximates a linear operation, and subsampling is equivalent to a blurred image. If the coefficients are large, subsampling can be viewed as a noisy OR operation or a noisy AND operation depending on the magnitude of the offset. The 2*2 receptive fields of each unit do not overlap, so the size of each feature map in S2 is 1/4 of the size of the feature map in C1 (1/2 of each row and column). The S2 layer has 12 trainable parameters and 5,880 connections.

Figure: Convolution and subsampling process: The convolution process involves: convolving an input image with a trainable filter fx (the first stage is the input image, the latter stage is the convolution feature map), and then An offset bx results in a convolutional layer Cx. The sub-sampling process consists of: summing four pixels per neighborhood into one pixel, then weighting by scalar Wx+1, increasing the offset bx+1, and then generating a feature map that is roughly reduced by a factor of four by a sigmoid activation function. Figure Sx+1.

Therefore, the mapping from one plane to the next can be regarded as a convolution operation, and the S-layer can be regarded as a fuzzy filter, which functions as a secondary feature extraction. The spatial resolution between the hidden layer and the hidden layer is decremented, and the number of planes contained in each layer is incremented, which can be used to detect more feature information.

The C3 layer is also a convolutional layer. It also deconvers the layer S2 through the 5x5 convolution kernel. The resulting feature map has only 10x10 neurons, but it has 16 types.

PREVIOUS：Image Semantic Segmentation Technology i NEXT：Text recognition