News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Deep neural network compression and acceleration in deep learning
Deep neural network compression and acceleration in deep learning
Reason
Regarding deep neural networks, regardless of parameters, computational complexity, data storage or network depth and width, etc., it will limit the application of deep neural networks in embedded and portable tools.
2. Accelerate and tighten tasks
2.1 Convolutional layer: The calculation takes a long time, can stop the network storage austerity through the weight sharing strategy, and touches the calculation acceleration of the network.
2.2 Full connection layer: Because it is a full connection between neurons, its network parameters are particularly large, so it touches the memory model of the network model.
These two tasks are designed into many learning courses: machine learning, parameter optimization, computing architecture, data compression, indexing, and hardware acceleration.
3. Mainstream approach
The main methods of tightening and accelerating the deep neural network are: parameter pruning, parameter sharing, low rank synthesis, compact convolution kernel design, and academic distillation.
Parameter pruning: The principle of designing whether the parameter is important or not, and removing the redundant parameters. It can be used for the convolution layer and the full connection layer, and the demand gradually stops the network crunch.
Parameter sharing: mainly explores the redundancy of model parameters, and uses techniques such as hashing or quantification to stop the weights from being tightened. It can be used in the convolutional layer and the full concatenation layer, and the demand gradually stops the network crunch. It can be re-trained in accordance with the model, or it can be played through the pre-workout network.
Low-rank synthesis: The matrix or tensor synthesis technique is used to estimate and synthesize the original convolution kernel in the depth model. It can be used in the convolutional layer and the full concatenation layer, and can perform end-to-end training under CPU/GPU conditions. It can be re-trained in accordance with the model, or it can be played through the pre-workout network.
Compact Convolution Kernel: The design of compact convolution kernel is mainly designed by special constructive convolution kernel or compact convolution calculation unit to reduce the storage and computational complexity of the model. And this method can only be used for convolution. Outside the nuclear. End-to-end workouts can be done under CPU/GPU conditions. Only support heavy exercise.
Learning Distillation: The main application of large-scale network learning, and the transfer of its knowledge to the model of tight distillation. Can be used for convolutional layers and full connection layer, only support heavy exercise.
4. Tightening and accelerating algorithm for deep neural networks
4.1 Parameter pruning
4.1.1 Structured approach
Network/parameter pruning removes redundant, less informative weights from existing well-trained deep network models, thereby reducing the parameters of the network model, thereby accelerating the computation of the model and the storage space of the compact model. In this way, after the pruning network, the model can be over-fitting. Whether the whole node can be deleted or filtered at one time, the parameter pruning can subdivide the red non-structural pruning and the structured pruning. Non-structural shearing Think about each element of each filter, delete the parameters of the filter element with 0, and construct the pruning directly to think about deleting the entire filter structure information.
LeCun: optimal brain damage: This learning model mimics the mammalian biological learning process, looking for minimally activated synaptic links, and then greatly reducing the number of connections during synaptic pruning number.
Hassibi and Stork proposed an optimal brain surgeon pruning strategy, applying a second-order partial-guided information (hessian matrix) of back-propagation weights, and applying this matrix to construct a significant score for each weight. Thereby removing the low significance weight.
Srinivas et al. proposed a saliency matrix that directly constructs and sorts weights without relying on data-free pruning and backpropagation, and removes nodes that are not significantly redundant. Because it does not rely on exercise data and backward propagation calculations Gradient information, so the network pruning process is faster.
Han Song et al. proposed a low-weight con-nection pruning strategy. The pruning method consists of three stages, namely, exercise connection, deletion of connection, and weight of exercise. Normal exercise, learn important connection; the second stage is calculated by the norm of the weight matrix, and the deduction of the node weight is less than the specified threshold, and the original dense network becomes a dense network; After the stage, the dense network is re-trained to restore the recognition accuracy of the network.
The above pruning method usually introduces a non-structural dense connection, which will cause irregular memory acquisition during the calculation process, which will affect the computational efficiency of the network.
4.1.2 Non-structural approach
In recent years, the tightening methods of deep network based on structured pruning have been proposed one after another, restraining the unaccelerated problem caused by non-structural dense connection, and its central idea relies on the principle of filtering significance (ie, the principle of verifying the least important filtering) ), thereby directly removing the saliency filter and speeding up the calculation of the network.
In 2016, Lebedev et al. proposed to participate in the construction of dense terms in the loss function of the traditional depth model, apply the stochastic gradient drop method to learn the structured dense loss function, and assign a filter smaller than the given threshold to 0, thus testing The stage directly deletes the entire convolution filter with a value of 0.
Wen Wei et al. participated in the loss function through regularization restrictions on filtering, channels, filter shapes, and layer depth of deep neural networks, applying structured and dense learning methods to learn Constructed convolution filtering.
Zhou et al. participate in the objective function by constructing dense constraints, and apply the forward-backward splitting method to deal with the optimization problem of constructing the densification limit, and directly determine the number of network nodes in the exercise process. Redundant nodes.
In addition, in recent years, the direct value of the direct measurement filter directly discriminates the saliency of the filter is also proposed, for example: directly delete the filter of the given L1 norm of the current layer, that is, remove the corresponding feature map, and then The number of channels of convolution filtering in the next layer is correspondingly reduced. Finally, the recognition accuracy of the model is improved after the re-exercise method. Since a large number of ReLU nonlinear activation functions exist in the mainstream deep network, the output feature map is output. Highly dense.
Hu et al. applied this feature to calculate the non-zero ratio of the output feature map corresponding to each filter as a criterion for discriminating the importance of filtering. NVDIA Molchanov et al. proposed a strategy based on global search saliency filtering to remove requirements. The filtering is replaced by a value of 0, and the Taylor function expansion is stopped for the objective function, and the filtering that minimizes the transformation of the objective function is discriminated as significant filtering. After the convolution calculation method, the filtering of the current layer and the volume of the next layer can be established. The input channels of the product filter have a correspondence relationship, and this feature is applied.
Luo et al. explored the importance of the input channel of the next layer of convolution kernel, instead of directly thinking about the current layer filtering, and establishing an effective channel selection optimization function to remove redundant channels and corresponding current layer filtering. The slashing method of the depth network of the pruning, after the entire filtering of the deleted convolution layer, does not introduce other additional data type storage, thereby directly tightening the network while accelerating the calculation of the entire network.
The defect of parameter pruning lies in the simple application of non-structural pruning, which cannot accelerate the calculation of dense matrix. Although in recent years, related software and hardware ports have been stopped from accelerating calculations, but the non-structural pruning scheme based on hardware and software still Can not be used in all deep learning frameworks, and the dependence of hardware will make the use of the model cost. Structured pruning does not rely on the support of hardware and software, and can be well embedded in the current mainstream deep learning framework, but fixed layer by layer The layer-by-layer fixed manner has led to the low self-conformity, efficiency and effectiveness of the network tightening. In addition, the above-mentioned pruning strategy requires manual discrimination of the sensitivity of each layer, thus requiring a large amount of mental analysis and Fine-tuning layer by layer.
4.2 Parameter sharing
Parameter sharing is to design a mapping to share multiple parameters with the same data. In recent years, quantization is the most direct expression of parameter sharing, and it has been widely used. In addition, Hash functions and structured linear mapping can also be shared as parameters. Performance mode. The following table will be updated from time to time.
Parameter sharing algorithm
Presenter algorithm
The author summarizes the development process of the parameter sharing concept and the lack of existence from a large number of documents. The texts are all expressed in the text.
4.3 Low rank synthesis
At the center is the application of matrix or vector synthesis technology to estimate and synthesize the original convolution kernel in the depth model. The author summarizes the development process and the lack of existence of the parameter sharing concept from a large number of documents, all of which are expressed in the text. , omitted here.
4.4 tight convolution kernel
The convolution kernel of the deep neural network is replaced by a compact filter to effectively tighten the deep network.
4.5 Learning Distillation
The soft Softmax transform learns the category distribution of the faculty output and refines the knowledge of the large faculty model into a smaller model.
4.6 Other methods
Global average pooling replaces the traditional 3-layer full-strength layer
Convolution-based fast Fourier transform
Using the Winograd algorithm
Random space sampling pooling
Regarding deep neural networks, regardless of parameters, computational complexity, data storage or network depth and width, etc., it will limit the application of deep neural networks in embedded and portable tools.
2. Accelerate and tighten tasks
2.1 Convolutional layer: The calculation takes a long time, can stop the network storage austerity through the weight sharing strategy, and touches the calculation acceleration of the network.
2.2 Full connection layer: Because it is a full connection between neurons, its network parameters are particularly large, so it touches the memory model of the network model.
These two tasks are designed into many learning courses: machine learning, parameter optimization, computing architecture, data compression, indexing, and hardware acceleration.
3. Mainstream approach
The main methods of tightening and accelerating the deep neural network are: parameter pruning, parameter sharing, low rank synthesis, compact convolution kernel design, and academic distillation.
Parameter pruning: The principle of designing whether the parameter is important or not, and removing the redundant parameters. It can be used for the convolution layer and the full connection layer, and the demand gradually stops the network crunch.
Parameter sharing: mainly explores the redundancy of model parameters, and uses techniques such as hashing or quantification to stop the weights from being tightened. It can be used in the convolutional layer and the full concatenation layer, and the demand gradually stops the network crunch. It can be re-trained in accordance with the model, or it can be played through the pre-workout network.
Low-rank synthesis: The matrix or tensor synthesis technique is used to estimate and synthesize the original convolution kernel in the depth model. It can be used in the convolutional layer and the full concatenation layer, and can perform end-to-end training under CPU/GPU conditions. It can be re-trained in accordance with the model, or it can be played through the pre-workout network.
Compact Convolution Kernel: The design of compact convolution kernel is mainly designed by special constructive convolution kernel or compact convolution calculation unit to reduce the storage and computational complexity of the model. And this method can only be used for convolution. Outside the nuclear. End-to-end workouts can be done under CPU/GPU conditions. Only support heavy exercise.
Learning Distillation: The main application of large-scale network learning, and the transfer of its knowledge to the model of tight distillation. Can be used for convolutional layers and full connection layer, only support heavy exercise.
4. Tightening and accelerating algorithm for deep neural networks
4.1 Parameter pruning
4.1.1 Structured approach
Network/parameter pruning removes redundant, less informative weights from existing well-trained deep network models, thereby reducing the parameters of the network model, thereby accelerating the computation of the model and the storage space of the compact model. In this way, after the pruning network, the model can be over-fitting. Whether the whole node can be deleted or filtered at one time, the parameter pruning can subdivide the red non-structural pruning and the structured pruning. Non-structural shearing Think about each element of each filter, delete the parameters of the filter element with 0, and construct the pruning directly to think about deleting the entire filter structure information.
LeCun: optimal brain damage: This learning model mimics the mammalian biological learning process, looking for minimally activated synaptic links, and then greatly reducing the number of connections during synaptic pruning number.
Hassibi and Stork proposed an optimal brain surgeon pruning strategy, applying a second-order partial-guided information (hessian matrix) of back-propagation weights, and applying this matrix to construct a significant score for each weight. Thereby removing the low significance weight.
Srinivas et al. proposed a saliency matrix that directly constructs and sorts weights without relying on data-free pruning and backpropagation, and removes nodes that are not significantly redundant. Because it does not rely on exercise data and backward propagation calculations Gradient information, so the network pruning process is faster.
Han Song et al. proposed a low-weight con-nection pruning strategy. The pruning method consists of three stages, namely, exercise connection, deletion of connection, and weight of exercise. Normal exercise, learn important connection; the second stage is calculated by the norm of the weight matrix, and the deduction of the node weight is less than the specified threshold, and the original dense network becomes a dense network; After the stage, the dense network is re-trained to restore the recognition accuracy of the network.
The above pruning method usually introduces a non-structural dense connection, which will cause irregular memory acquisition during the calculation process, which will affect the computational efficiency of the network.
4.1.2 Non-structural approach
In recent years, the tightening methods of deep network based on structured pruning have been proposed one after another, restraining the unaccelerated problem caused by non-structural dense connection, and its central idea relies on the principle of filtering significance (ie, the principle of verifying the least important filtering) ), thereby directly removing the saliency filter and speeding up the calculation of the network.
In 2016, Lebedev et al. proposed to participate in the construction of dense terms in the loss function of the traditional depth model, apply the stochastic gradient drop method to learn the structured dense loss function, and assign a filter smaller than the given threshold to 0, thus testing The stage directly deletes the entire convolution filter with a value of 0.
Wen Wei et al. participated in the loss function through regularization restrictions on filtering, channels, filter shapes, and layer depth of deep neural networks, applying structured and dense learning methods to learn Constructed convolution filtering.
Zhou et al. participate in the objective function by constructing dense constraints, and apply the forward-backward splitting method to deal with the optimization problem of constructing the densification limit, and directly determine the number of network nodes in the exercise process. Redundant nodes.
In addition, in recent years, the direct value of the direct measurement filter directly discriminates the saliency of the filter is also proposed, for example: directly delete the filter of the given L1 norm of the current layer, that is, remove the corresponding feature map, and then The number of channels of convolution filtering in the next layer is correspondingly reduced. Finally, the recognition accuracy of the model is improved after the re-exercise method. Since a large number of ReLU nonlinear activation functions exist in the mainstream deep network, the output feature map is output. Highly dense.
Hu et al. applied this feature to calculate the non-zero ratio of the output feature map corresponding to each filter as a criterion for discriminating the importance of filtering. NVDIA Molchanov et al. proposed a strategy based on global search saliency filtering to remove requirements. The filtering is replaced by a value of 0, and the Taylor function expansion is stopped for the objective function, and the filtering that minimizes the transformation of the objective function is discriminated as significant filtering. After the convolution calculation method, the filtering of the current layer and the volume of the next layer can be established. The input channels of the product filter have a correspondence relationship, and this feature is applied.
Luo et al. explored the importance of the input channel of the next layer of convolution kernel, instead of directly thinking about the current layer filtering, and establishing an effective channel selection optimization function to remove redundant channels and corresponding current layer filtering. The slashing method of the depth network of the pruning, after the entire filtering of the deleted convolution layer, does not introduce other additional data type storage, thereby directly tightening the network while accelerating the calculation of the entire network.
The defect of parameter pruning lies in the simple application of non-structural pruning, which cannot accelerate the calculation of dense matrix. Although in recent years, related software and hardware ports have been stopped from accelerating calculations, but the non-structural pruning scheme based on hardware and software still Can not be used in all deep learning frameworks, and the dependence of hardware will make the use of the model cost. Structured pruning does not rely on the support of hardware and software, and can be well embedded in the current mainstream deep learning framework, but fixed layer by layer The layer-by-layer fixed manner has led to the low self-conformity, efficiency and effectiveness of the network tightening. In addition, the above-mentioned pruning strategy requires manual discrimination of the sensitivity of each layer, thus requiring a large amount of mental analysis and Fine-tuning layer by layer.
4.2 Parameter sharing
Parameter sharing is to design a mapping to share multiple parameters with the same data. In recent years, quantization is the most direct expression of parameter sharing, and it has been widely used. In addition, Hash functions and structured linear mapping can also be shared as parameters. Performance mode. The following table will be updated from time to time.
Parameter sharing algorithm
Presenter algorithm
The author summarizes the development process of the parameter sharing concept and the lack of existence from a large number of documents. The texts are all expressed in the text.
4.3 Low rank synthesis
At the center is the application of matrix or vector synthesis technology to estimate and synthesize the original convolution kernel in the depth model. The author summarizes the development process and the lack of existence of the parameter sharing concept from a large number of documents, all of which are expressed in the text. , omitted here.
4.4 tight convolution kernel
The convolution kernel of the deep neural network is replaced by a compact filter to effectively tighten the deep network.
4.5 Learning Distillation
The soft Softmax transform learns the category distribution of the faculty output and refines the knowledge of the large faculty model into a smaller model.
4.6 Other methods
Global average pooling replaces the traditional 3-layer full-strength layer
Convolution-based fast Fourier transform
Using the Winograd algorithm
Random space sampling pooling
PREVIOUS:Face biometric authentication
NEXT:Taiwan ID card identification