A Survey of Model Compression Methods in AI Deep Learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

Foreword
At present, two factions are classified in the deep learning category, one is academic, discussing powerful and complex model networks and experimental methods, in order to pursue higher performance; the other is engineering, aiming to make the algorithm more stable and efficient. On the hardware platform, efficiency is the goal of its pursuit. Although complex models have better performance, the high storage space and computational resource consumption are important reasons for their difficulty in effectively applying to various hardware platforms.

Recently, I have been paying attention to the method of tightening the deep learning model. I have found that there are more and more discussions on the model austerity method. From theoretical discussion to platform completion, I have obtained a very large pause.

In 2015, Han published Deep Compression, a review article on model austerity. It used cropping, weight sharing and quantification, coding, etc. in model austerity, and achieved very good results as a best paper for ICLR2016. It also caused a wave of research on model austerity methods. In fact, the model tightening can be traced back to 1989. The optimal Brain Damage (OBD) of Lecun's father was proposed to remove the unimportant parameters in the network and reach the effect of tightening the size. It is terrible to think about it. The deep network can't be exercised, and there is no such a prosperous technology. Lecun once thought about how to do the cutting. It is really foresight. At present, many cutting plans are based on the OBD method of the old man.

At present, the research on the deflation method of the deep learning model can be mainly divided into the following directions:
The design of more sophisticated models, many of today's networks have modular design, large in depth and width, which also forms a lot of redundancy of parameters, so there are many discussions about model design, such as SqueezeNet, MobileNet, etc. With more detailed and efficient model design, the model size can be reduced to a large extent, and it also has good performance.
Model cutting, constructing a complex network has very good performance, and its parameters are also redundant. Therefore, with regard to the model network that has been exercised, it is possible to find an effective evaluation skill, and stop the cropping by unimportant connection or filter to reduce the model. Redundancy.
The density of the core, during the exercise process, the weight update is stopped and induced to make it more dense. For the dense matrix, more and more compact storage methods, such as CSC, can be applied, but the operation efficiency of the dense matrix operation on the hardware platform is not High, vulnerable to bandwidth, so acceleration is not obvious.
In addition, quantitative, Low-rank synthesis, migration learning and other methods have also been discussed, and played a very good effect in the model tightening.

Nuclear-based densification
The thickening of the nucleus is to stop the induction of the regularization of the weights during the exercise, so that the weights are more dense, so that the weights of most of them are zero. The thickening method of the kernel is divided into regular and irregular. After the thickening of regular, it is easier to cut, especially for the matrix operation of im2col, and the efficiency is higher when the matrix of im2col is thickened, and the parameter requires a specific storage method, or demand. Support for the dense matrix operation library on the platform, the papers that can be referenced are:

Learning Structured Sparsity in Deep Neural Networks
The author of this paper proposes a learning method of Structured Sparsity Learning, which can learn a dense structure to reduce the computational cost, and the learned structural densification can effectively stop the acceleration in hardware. Traditional unstructured random densification can lead to irregular memory access, so it can not effectively stop acceleration on hardware platforms such as GPU. The author adds a group lasso restriction on the destination function of the network, which can complete the filter level, channel level and shape level densification. All densification operations are stopped based on the loss func below, where Rg is group lasso:
Write a picture here
Then filter-channel wise:
Write a picture here
And shape wise:
Write a picture here
Since the weight tensor is pulled into the structure of the matrix in the GEMM, the rows and columns of the 2D matrix can be densified by separating the thickening of the filter level and the shape level, and then the culling is eliminated on the rows and columns of the matrix. A value of all 0 can reduce the dimensions of the matrix and improve the computational efficiency of the model. This method is a regular method, the crunch granularity is coarse, and can be applied to a variety of off-the-shelf algorithm libraries, but the convergence and optimization difficulty of the exercise is not certain.

The author proposes a dynamic model cutting method, including the following two processes: pruning and splicing, where pruning is to cut the weight that is not required, but often can not intuitively determine which weights are important, so increase here A splicing process, which restores the important cut weight back, is similar to a surgical procedure, patching important constructs back. Its algorithm is as follows:
Write a picture here
The author completes by adding a T to W. T is a 2-value matrix, which is equivalent to the function of a mask. When a position is 1, the weight of the position is saved. When it is 0, it is cropped. During the exercise, a learning mask is used to eliminate the value that is really unimportant in the weight, which makes the weight dense. Since deleting the connection of some networks will incur changes in the importance of other connections in the network, it is more appropriate to optimize the minimum loss function to exercise the deleted network.

The algorithm adopts the strategy of separation of pruning and grafting, and synchronization with exercise and tightening to complete the network austerity task. Through the introduction of network grafting operations, the performance loss caused by false pruning is prevented, and the theoretical limit of network austerity is better approached in practice. It is an irregular way, but the values of ak and bk are not certain in different models and different layers, and are easily limited by the dense matrix algorithm library and bandwidth.
Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods
The author wants to exercise a dense network to reduce the computational complexity of the model. Adding a L0 paradigm about W to the loss function of the network can reduce the density of W, but the L0 paradigm leads to an NP problem. It is difficult to optimize the solution problem, so the author exercises this dense network from another thought. The flow of the algorithm is as follows:

First exercise the network s1 round normally, then Ok (W) means to select the k number with the largest value in W, and set the remaining value to 0, supp (W, k) represents the serial number of the largest k values in W Continue to exercise the s2 round, update only the non-zero W, then release the W that was previously set to 0 to stop updating, continue to exercise the s1 round, and repeat until the end of the exercise. It is also the way to stop the induction of parameters, trimming while exercising, first cutting out the values that are considered unimportant, and then recovering the important but misplaced parameters through a restore process. It is also a way of irregular, training and cutting, performance is good, the strength of the tightening is difficult to guarantee.

to sum up
The above three articles are all based on the method of nuclear densification. They all stop the restriction of the parameter update during the exercise process, making it tend to be dense, or cut off the unimportant connection during the exercise process, the first of which The article provides a structured thickening that can be applied using GEMM's matrix operations to complete the acceleration. The second article also increases the limit when the weight is updated. Although the restriction on the update of the weight can reach the purpose of densification well, the optimization of the exercise increases the difficulty and reduces the convergence of the model. In addition, the second and third articles are non-structural thickening, which is easily limited by the dense matrix algorithm library and bandwidth. These two articles use a surgery process after truncation to reduce important parameters. The risk of cropping. It will then stop introducing other model austerity methods.

PREVIOUS：Support for passport comparison NEXT：Desktop card comparison terminal