News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Introduction to artificial intelligence convolutional neural network
Introduction to artificial intelligence convolutional neural network
What is convolution, the motivation of convolution
Convolution operation
Convolution is a special kind of linear operation, which is a mathematical operation on two real-valued functions. The convolution operation is usually represented by the symbol ∗∗. We use the example in Kalman filtering as an example to discuss a one-dimensional discrete Convolution of the way:
Assuming that our recyclable spacecraft is landing, and its sensors measure its own height information from time to time, we use h(i)h(i) to indicate the height measurement of ii hour, which is measured at a certain frequency (ie every other one) The time distance is measured once, so the measurement h(i) is discrete h(i) is discrete), limited by the sensor, we know that the measurement is not accurate, so we use a weighted uniform approach to simple disposal, detail In fact, we can think that the closer to the measurement of the hour ii, the more the true height of the time ii, that is, we measure s(i)=wih(i)+wi−1h(i−1)+wi−2h (i−2)...s(i)=wih(i)+wi−1h(i−1)+wi−2h(i−2)... where the weight wi>wi−1>wi−2 ...wi>wi−1>wi−2... . This is a one-dimensional discrete method of convolution. Since we can't get "future measurement" in this example, we only include half of the one-dimensional discrete convolution. Below is the perfect formula for one-dimensional discrete convolution:
s(i)=(h∗w)(i)=∑j=−∞∞h(j)w(i−j)
s(i)=(h∗w)(i)=∑j=−∞∞h(j)w(i−j)
Where ii represents the state we calculated (hour, position), jj represents the interval to state ii (can be time difference, space interval, etc.), where hh and ww represent two real-valued functions, respectively. In the terminology of convolutional neural networks, the first function hh is called input, the second function ww is called kernal function, and the output ss is called feature map. Obviously, In the practical example, jj (that is, the interval we consider) is generally not negative and large, and it is usually a small range. In deep learning applications, the input is usually a high-dimensional array (such as an image), and the kernel function is also an array of high-dimensional parameters generated by an algorithm such as a random gradient drop. If you enter 2D image II, then we also need to use two-dimensional kernel KK, then this two-dimensional convolution can be written as:
S(m.n)=(I∗K)(m,n)=∑i∑jI(i,j)K(m−i,n−j)
S(m.n)=(I∗K)(m,n)=∑i∑jI(i,j)K(m−i,n−j)
Where (m,n)(m,n) is the calculated pixel position, and (i,j)(i,j) is the range of consideration. When we express it in an increasingly intuitive way, the two-dimensional convolution is as follows:
Convolutional motive
So after answering what is convolution, let's see why we use convolution to do this linear operation. First we look at the definition of convolutional neural networks:
A convolutional neural network is a neural network that uses at least one layer of convolution operations in place of ordinary matrix multiplication operations in the network.
We know that the input side of the full connection layer is practically multiplied and then accumulated, that is, essentially a matrix multiplication. Then the convolution layer actually uses the convolution operation instead of the matrix multiplication in the original full connection layer. The starting point of convolution is to improve the machine learning system through the following three ideas:
Sparse interactions
Parameter sharing
Equivariant representations
Dense interaction
With regard to the normal full-connected network, the nodes between the layers are fully connected:
But for a convolutional network, the nodes of the next layer are only related to the nodes to which the convolution kernel acts:
An intuitive benefit of using dense connections is that there are fewer parameters for the network. Let's take a 200×200200×200 grayscale image as an example. When it is input into a fully connected neural network, it is as follows:
Assuming that the first hidden layer of this network has 40,000 neurons (40,000 hidden layer nodes are suitable for the case where the input sample is 40,000 dimensions), then this layer of network light is close. 2 billion parameters. The amount of calculations for such model exercises is very large and requires a large amount of storage space.
Regarding the convolutional network, the situation is as follows:
Write a picture here
Here we still use 40,000 hidden layer neurons. Our convolution kernel (also known as filter) has a size of 10×1010×10. The amount of parameters for such a layer convolution is only about 4000000. The number of parameters is much smaller than the fully connected network.
Readers may have questions? The output of the convolution is only related to the input part. If a certain rule is not built on some features, but is related to the whole input, is the representation of the convolution set up not perfect? it's not true. Modern convolutional networks often require multiple convolutional layers to be superimposed. Convolutional networks are densely connected in direct cohesion, but cells in deeper layers can indirectly connect to all or most of the input images, as shown below. Shown as follows:
Write a picture here
Tip: In the relevant literature of convolutional networks, there are terms: neuron, kernal, and filter, all of which refer to the same thing - the kernel function. In this paper, we collectively refer to the volume. Accumulation.
Parameter sharing
The convolution kernel is actually the parameter of the convolution network. The convolution kernel slides the window on the input image, which means that the pixels of the input image share this set of parameters, as shown in the following figure:
Write a picture here
The parameter sharing in the convolutional network allows us to learn only one parameter confluence, without the need to learn a separate parameter confluence for each pixel, which makes the storage space required by the model greatly reduced.
Equivariant representation
Since the entire input picture shares a set of parameters, the model has an equal variation with respect to certain feature translations in the image. So, what is the change?
Suppose the function f(x)f(x) and the function g(x)g(x) satisfy:
f(g(x))=g(f(x))
f(g(x))=g(f(x))
Then we call the function ff equal to the transformation gg. Similarly, translation is the function gg. If we translate the input object, then the representation in the output will also translate the same amount. This property is useful for detecting some common structures (such as edges) in the input. Especially in the first few layers of the convolutional neural network (near the input layer).
Convolutional neural network
The figure below is a typical convolutional neural network layer (referred to as the convolutional layer). The traditional convolutional layer contains the following three constructs:
Convolution operation
Activation function (nonlinear transformation)
Pooling (Pooling)
Write a picture here
The activation function here plays the same role as the full-joining network. ReLUReLU is the most commonly used activation function. Let's discuss the pooling in detail.
Pooling
Pooling is often referred to as a pooling function. The definition of a pooling function is: an overall statistical feature that uses adjacent locations to exchange values for that location. The idea of pooling is somewhat uniform to the sliding window in the timing problem. The figure below shows a pooling approach - maxpooling:
Write a picture here
The figure above shows a 2×2 maximum pooling with a Stride of 2. We can understand that using a 2×22×2 window, sliding the window on the input image in steps of 2, the calculation window Enter the maximum value of the element and output it. It is not difficult to find that after such a pooling function, the input size is "tightened". At the same time, the pooling does not introduce additional parameters, that is, pooling can reduce the size of the input, which means that we are behind There are fewer parameters required in the convolutional layer, so the number of parameters of the entire neural network will be further reduced after the pooling. The following is the calculation formula for the size of the pooled input and output:
Assuming the input size is: w × h × dw × h × d, the stride is ss, the size of the window is f × ff × f, then the width, height and depth of the output are:
Wout=(w−f)s+1
Wout=(w−f)s+1
Hout=(h−f)s+1
Hout=(h−f)s+1
Dout=d
Dout=d
The commonly used pooling functions mainly include Max Pooling (Average Pooling), which is the maximum war mean of the adjacent matrix regions, regardless of the pooling, in the input image. The small amount of translation of the purpose is invariant, that is, the target object in the input has a small amount of translation, and the output of the pooling function does not change. When we stop the pooling of the output of the convolution, since the convolution learning is a separate feature (the underlying convolution learns various edge features), there may be some transformations (translation, rotation, etc.). Add a pooling function to further learn which transformations should be invariant.
Some details of the convolution
We have roughly understood what is convolution. In convolutional neural networks, convolution calculations have some details to consider.
Fill and input and output sizes
The first is the size conversion of the input and output. As with the previous pooling, we assume that the input size is w×h×dw×h×d, the convolution step is ss, and the convolution kernel is f×ff×f, often in convolutional networks. A disposal method, called padding, if we don't want our convolution kernel to slide over the border of the image, we call it valid padding, let pp be the number of pixels filled, then use the effective padding. P=0p=0 when dealing with borders, but in the first few layers of the convolutional network, we have to keep as much raw input as possible so that we can extract these low-order features. We want to apply the same convolutional layer, but we want to keep the output the same width and height as the input. To do this, we fill a circle with a certain number of zeros so that the output and input of the convolution has The same width and height, we call it the same padding. The width, height and depth of the output are calculated as:
Wout=(w−f+2p)s+1
Wout=(w−f+2p)s+1
Hout=(w−f+2p)s+1
Hout=(w−f+2p)s+1
Dout=k
Dout=k
Where kk represents the depth of the convolution kernel.
Depth of convolution kernel
In general, we use multiple convolution kernels, as shown in the following figure:
Write a picture here
Different cores learn different characteristics. Some cores may learn some color features. Some cores may learn some edge and shape features. The following figure is the nuclear visualization effect of the once-exercised convolutional neural network in the same layer (Krizhevsky Et al.)
Write a picture here
The number of convolution kernels we have become the depth of the convolution kernel.
LeNet
The following figure shows that LeNet is a convolutional network proposed by LeCun et al. in 1998 for handling handwriting recognition. Its overall structure is as follows:
Write a picture here
We started from LeNet to understand the design of the convolutional network. As shown in the figure, the convolutional network usually uses a pyramidal structure, that is, as the number of layers increases, the depth of the output increases from time to time. At the same time, we use such as pooling, valid padding and large stride to reduce the width and height of the output. At the same time, the size of the convolution kernel has been chosen. In general, we often use larger convolution kernels in the convolution layer close to the input to reduce the size of the output (such as 7 × 77 × 7). A small convolution kernel is used in the subsequent convolutional layer to adequately represent the feature representation (eg, 3×33×3).
The end of the convolutional network is similar to the feedforward neural network. We present the output of the last convolutional layer into a vector and input it into a multi-layer perceptron. For the classification problem, we still use the interpolation entropy as the loss function and use the stochastic gradient. Algorithms such as landings exercise the parameters of the entire neural network.
Convolution operation
Convolution is a special kind of linear operation, which is a mathematical operation on two real-valued functions. The convolution operation is usually represented by the symbol ∗∗. We use the example in Kalman filtering as an example to discuss a one-dimensional discrete Convolution of the way:
Assuming that our recyclable spacecraft is landing, and its sensors measure its own height information from time to time, we use h(i)h(i) to indicate the height measurement of ii hour, which is measured at a certain frequency (ie every other one) The time distance is measured once, so the measurement h(i) is discrete h(i) is discrete), limited by the sensor, we know that the measurement is not accurate, so we use a weighted uniform approach to simple disposal, detail In fact, we can think that the closer to the measurement of the hour ii, the more the true height of the time ii, that is, we measure s(i)=wih(i)+wi−1h(i−1)+wi−2h (i−2)...s(i)=wih(i)+wi−1h(i−1)+wi−2h(i−2)... where the weight wi>wi−1>wi−2 ...wi>wi−1>wi−2... . This is a one-dimensional discrete method of convolution. Since we can't get "future measurement" in this example, we only include half of the one-dimensional discrete convolution. Below is the perfect formula for one-dimensional discrete convolution:
s(i)=(h∗w)(i)=∑j=−∞∞h(j)w(i−j)
s(i)=(h∗w)(i)=∑j=−∞∞h(j)w(i−j)
Where ii represents the state we calculated (hour, position), jj represents the interval to state ii (can be time difference, space interval, etc.), where hh and ww represent two real-valued functions, respectively. In the terminology of convolutional neural networks, the first function hh is called input, the second function ww is called kernal function, and the output ss is called feature map. Obviously, In the practical example, jj (that is, the interval we consider) is generally not negative and large, and it is usually a small range. In deep learning applications, the input is usually a high-dimensional array (such as an image), and the kernel function is also an array of high-dimensional parameters generated by an algorithm such as a random gradient drop. If you enter 2D image II, then we also need to use two-dimensional kernel KK, then this two-dimensional convolution can be written as:
S(m.n)=(I∗K)(m,n)=∑i∑jI(i,j)K(m−i,n−j)
S(m.n)=(I∗K)(m,n)=∑i∑jI(i,j)K(m−i,n−j)
Where (m,n)(m,n) is the calculated pixel position, and (i,j)(i,j) is the range of consideration. When we express it in an increasingly intuitive way, the two-dimensional convolution is as follows:
Convolutional motive
So after answering what is convolution, let's see why we use convolution to do this linear operation. First we look at the definition of convolutional neural networks:
A convolutional neural network is a neural network that uses at least one layer of convolution operations in place of ordinary matrix multiplication operations in the network.
We know that the input side of the full connection layer is practically multiplied and then accumulated, that is, essentially a matrix multiplication. Then the convolution layer actually uses the convolution operation instead of the matrix multiplication in the original full connection layer. The starting point of convolution is to improve the machine learning system through the following three ideas:
Sparse interactions
Parameter sharing
Equivariant representations
Dense interaction
With regard to the normal full-connected network, the nodes between the layers are fully connected:
But for a convolutional network, the nodes of the next layer are only related to the nodes to which the convolution kernel acts:
An intuitive benefit of using dense connections is that there are fewer parameters for the network. Let's take a 200×200200×200 grayscale image as an example. When it is input into a fully connected neural network, it is as follows:
Assuming that the first hidden layer of this network has 40,000 neurons (40,000 hidden layer nodes are suitable for the case where the input sample is 40,000 dimensions), then this layer of network light is close. 2 billion parameters. The amount of calculations for such model exercises is very large and requires a large amount of storage space.
Regarding the convolutional network, the situation is as follows:
Write a picture here
Here we still use 40,000 hidden layer neurons. Our convolution kernel (also known as filter) has a size of 10×1010×10. The amount of parameters for such a layer convolution is only about 4000000. The number of parameters is much smaller than the fully connected network.
Readers may have questions? The output of the convolution is only related to the input part. If a certain rule is not built on some features, but is related to the whole input, is the representation of the convolution set up not perfect? it's not true. Modern convolutional networks often require multiple convolutional layers to be superimposed. Convolutional networks are densely connected in direct cohesion, but cells in deeper layers can indirectly connect to all or most of the input images, as shown below. Shown as follows:
Write a picture here
Tip: In the relevant literature of convolutional networks, there are terms: neuron, kernal, and filter, all of which refer to the same thing - the kernel function. In this paper, we collectively refer to the volume. Accumulation.
Parameter sharing
The convolution kernel is actually the parameter of the convolution network. The convolution kernel slides the window on the input image, which means that the pixels of the input image share this set of parameters, as shown in the following figure:
Write a picture here
The parameter sharing in the convolutional network allows us to learn only one parameter confluence, without the need to learn a separate parameter confluence for each pixel, which makes the storage space required by the model greatly reduced.
Equivariant representation
Since the entire input picture shares a set of parameters, the model has an equal variation with respect to certain feature translations in the image. So, what is the change?
Suppose the function f(x)f(x) and the function g(x)g(x) satisfy:
f(g(x))=g(f(x))
f(g(x))=g(f(x))
Then we call the function ff equal to the transformation gg. Similarly, translation is the function gg. If we translate the input object, then the representation in the output will also translate the same amount. This property is useful for detecting some common structures (such as edges) in the input. Especially in the first few layers of the convolutional neural network (near the input layer).
Convolutional neural network
The figure below is a typical convolutional neural network layer (referred to as the convolutional layer). The traditional convolutional layer contains the following three constructs:
Convolution operation
Activation function (nonlinear transformation)
Pooling (Pooling)
Write a picture here
The activation function here plays the same role as the full-joining network. ReLUReLU is the most commonly used activation function. Let's discuss the pooling in detail.
Pooling
Pooling is often referred to as a pooling function. The definition of a pooling function is: an overall statistical feature that uses adjacent locations to exchange values for that location. The idea of pooling is somewhat uniform to the sliding window in the timing problem. The figure below shows a pooling approach - maxpooling:
Write a picture here
The figure above shows a 2×2 maximum pooling with a Stride of 2. We can understand that using a 2×22×2 window, sliding the window on the input image in steps of 2, the calculation window Enter the maximum value of the element and output it. It is not difficult to find that after such a pooling function, the input size is "tightened". At the same time, the pooling does not introduce additional parameters, that is, pooling can reduce the size of the input, which means that we are behind There are fewer parameters required in the convolutional layer, so the number of parameters of the entire neural network will be further reduced after the pooling. The following is the calculation formula for the size of the pooled input and output:
Assuming the input size is: w × h × dw × h × d, the stride is ss, the size of the window is f × ff × f, then the width, height and depth of the output are:
Wout=(w−f)s+1
Wout=(w−f)s+1
Hout=(h−f)s+1
Hout=(h−f)s+1
Dout=d
Dout=d
The commonly used pooling functions mainly include Max Pooling (Average Pooling), which is the maximum war mean of the adjacent matrix regions, regardless of the pooling, in the input image. The small amount of translation of the purpose is invariant, that is, the target object in the input has a small amount of translation, and the output of the pooling function does not change. When we stop the pooling of the output of the convolution, since the convolution learning is a separate feature (the underlying convolution learns various edge features), there may be some transformations (translation, rotation, etc.). Add a pooling function to further learn which transformations should be invariant.
Some details of the convolution
We have roughly understood what is convolution. In convolutional neural networks, convolution calculations have some details to consider.
Fill and input and output sizes
The first is the size conversion of the input and output. As with the previous pooling, we assume that the input size is w×h×dw×h×d, the convolution step is ss, and the convolution kernel is f×ff×f, often in convolutional networks. A disposal method, called padding, if we don't want our convolution kernel to slide over the border of the image, we call it valid padding, let pp be the number of pixels filled, then use the effective padding. P=0p=0 when dealing with borders, but in the first few layers of the convolutional network, we have to keep as much raw input as possible so that we can extract these low-order features. We want to apply the same convolutional layer, but we want to keep the output the same width and height as the input. To do this, we fill a circle with a certain number of zeros so that the output and input of the convolution has The same width and height, we call it the same padding. The width, height and depth of the output are calculated as:
Wout=(w−f+2p)s+1
Wout=(w−f+2p)s+1
Hout=(w−f+2p)s+1
Hout=(w−f+2p)s+1
Dout=k
Dout=k
Where kk represents the depth of the convolution kernel.
Depth of convolution kernel
In general, we use multiple convolution kernels, as shown in the following figure:
Write a picture here
Different cores learn different characteristics. Some cores may learn some color features. Some cores may learn some edge and shape features. The following figure is the nuclear visualization effect of the once-exercised convolutional neural network in the same layer (Krizhevsky Et al.)
Write a picture here
The number of convolution kernels we have become the depth of the convolution kernel.
LeNet
The following figure shows that LeNet is a convolutional network proposed by LeCun et al. in 1998 for handling handwriting recognition. Its overall structure is as follows:
Write a picture here
We started from LeNet to understand the design of the convolutional network. As shown in the figure, the convolutional network usually uses a pyramidal structure, that is, as the number of layers increases, the depth of the output increases from time to time. At the same time, we use such as pooling, valid padding and large stride to reduce the width and height of the output. At the same time, the size of the convolution kernel has been chosen. In general, we often use larger convolution kernels in the convolution layer close to the input to reduce the size of the output (such as 7 × 77 × 7). A small convolution kernel is used in the subsequent convolutional layer to adequately represent the feature representation (eg, 3×33×3).
The end of the convolutional network is similar to the feedforward neural network. We present the output of the last convolutional layer into a vector and input it into a multi-layer perceptron. For the classification problem, we still use the interpolation entropy as the loss function and use the stochastic gradient. Algorithms such as landings exercise the parameters of the entire neural network.