Regularization in deep learning and machine learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

Regularization is one of the more effective techniques for suppressing over-fitting in machine learning and deep learning. Good algorithms should have good generalization ability, that is, not only perform well on exercise set data, and push to unknown. The test data can also have good performance. Regularization is a general term for a strategy that explicitly reduces the generalization error to improve the versatility of the algorithm. Due to the large number of hidden nodes in deep learning and the numerous parameters that are touched, regularization becomes even more important. This paper discusses this concept from two aspects: the definition of regularization and the classification of regularization.

First, the definition of regularization:

Regularization is defined as a modification of the learning algorithm whose purpose is to reduce the generalization error. Generally speaking, the fall of generalization error comes at the cost of the increase of exercise error, but some better algorithms can also coordinate the good performance of generalization error and exercise error.

Regularization can be seen as the application of the Occam razor criterion in learning algorithms. Okam Standard: When the two hypotheses have the same explanatory power and predictive power, the simpler hypothesis is used as a basis for discussion. In machine learning, the regularization process is the more simple model.

From the perspective of probability theory, many regularization techniques correspond to applying a certain a priori dispersion on the model parameters, which is to change the construction of generalization errors. Regularization is a compromise between under-fitting and over-fitting, with significant reductions in variance without excessively increasing bias. Regularization can change the data spread so that the data spread obtained by the model matches the real data generation process as much as possible.

The task of machine learning is to fit a spread from input x to output y. The process of fitting uses the process of minimizing the risk function. The regularization process makes the function to be minimized contain both constructive errors. The function also contains artificially introduced regularization terms. Since the risk of unknown spread cannot be directly solved, the requirement introduces an exercise data set to approximate the risk of the experience calculated on the exercise data set, and minimizes the risk by minimizing the risk.

The above is the overall flow of the learning algorithm, and it is also the central role of regularization. The regularization of the treatment is based on the different variables in the learning algorithm. ‘

Second, the classification of regularization strategy:

1. Regularization based on exercise data: The quality of the exercise model depends largely on the exercise data. In addition to selecting less noisy exercise data, regularization can be used to improve the quality of exercise data. One purpose of regularizing disposition data is to perform pre-processing and feature extraction to modify feature space or data dispersion to other methods; another purpose is to create a reinforced data set with larger capacity or even unlimited capacity by generating a new sample. . These two purposes are independent of each other and can therefore be used separately. Regularization based on exercise data includes the following two common methods:

Dataset Enhancement: The practice of regularizing exercise data is to apply a transformation on the exercise dataset to generate a new workout dataset. The way to transform is to satisfy the random variable of a certain probability as a function of the independent variable. The simplest example is to add random Gaussian noise to the data. Since the most straightforward way to generalize the machine algorithm model is to use more data to stop the exercise, the transformation using random parameters can be used to generate 'false' data. This approach is called data set enhancement.
Dropout: Dropout is an integrated approach that separates multiple models to reduce generalization errors. The reason why Dropout is based on the regularization of exercise data is that it has different data sets to exercise different models, and each data set is sampled after the original data set is stopped. The key idea of Dropout is to randomly drop neurons and their connections from the neural network during exercise to get a simplified network. At the time of testing, a simple small weight network can approach the uniformity of all these simplified network predictions. The advantage is that the calculation is simple and convenient, and at the same time has the universality of different models and exercise processes. However, Dropout has a high demand for exercise set capacity, and a small number of exercise samples can't take advantage of it.
2. Regularization based on network architecture: The mapping from input to output must have certain traits to conform to the data well, and the assumption of stopping the input-output mapping is corresponding to the choice of network construction. A regularization approach based on network architecture has been aroused. The assumption of mapping can focus on the detailed operations at different levels in the deep network, as well as the connection between layers. Regularization based on network architecture usually simplifies the assumption about mapping, and then makes the network architecture approach the simplified mapping. This limits the search space of the model and provides the possibility to find a better solution.

Parameter sharing: Parameter sharing is a regularization method for reusing parameters. By forcing certain parameters to wait, it is possible to have different models share unique parameters. So that they produce similar output for similar inputs, if you relax the conditions of parameter sharing, so that they do not want to wait for each other but close to each other, the corresponding is to add a regular term to the parameter norm. A common method of parameter sharing is to regularize the parameters of a monitoring learning model to approach another unsupervised learning model, and this unsupervised learning model can match the parameters of the monitoring model.
Regularization of transmission parameters: Some transfer functions are specifically designed for regularization, such as the maxout unit used in Dropout, which more accurately approximates the convergent uniformity of the model separation prediction results during testing. After adding noise, the original transfer function can be generalized into a stochastic model, and its scatter characteristics can be applied.
3. Regularization based on the erro function and regularization based on the regularization:

Regularization based on error functions and regularization based on regularization terms can be discussed together. Ideally, the error function should properly reflect the performance of the algorithm and exhibit some characteristics of data dispersion (such as mean squared error or interpolation entropy). Stopping the regularization of the error function is equivalent to adding an additional learning task, which in turn causes a change in the purpose of the attack, which is manifested in an additional regularization term in the error function. Therefore, in most cases, the discussion of regularization based on regularization term includes regularization based on error function.
Regularization is also called a penalty. Unlike the error function, the regularization term is independent of the purpose, but is used to represent other properties of the desired model. The error function represents the divergence between the algorithm output and the destination real output, and the regularization term represents an additional assumption about the mapping relationship. This feature determines that the value of the regularization term can be calculated by the unmarked test sample, and the test data is used to improve the learning model.
A commonly used regularization term is weight decay. The parameters in deep learning include the weight coefficients and offsets in each neuron. Since each weight specifies the way in which two variables interact, the amount of data required to fit the weights is much more biased. In contrast, each offset only controls one variable, and even if it is not regularized, it will not produce too much variance. The regularization does not increase the bias of the algorithm. This is the reason why the regularized object only includes weights and does not include bias.
In weight decay, the regularization term is expressed in a norm. The commonly used norms include the L1 norm and the L2 norm, that is, the LASSO regression and the ridge regression.
When the L2 norm is used as a regularization term, its role is to make the weight coefficient closer to the origin. After the weight decay is introduced, the weight vector is shrunk before the gradient update for each step. On the whole, this makes the weights in the direction of significantly reducing the objective function intact, and the weight corresponding to the direction in which the objective function is reduced is gradually attenuated by regularization. From the perspective of generalization error, the L2 norm can perceive inputs with higher variance, and the weights associated with these input features are shrunk.
In contrast, the L1 norm and the L2 norm are substantially different. L1 regularization yields a dense solution that cuts off a small portion of the weight directly. This simplifies the learning problem by selecting meaningful features from the subset of features available.

PREVIOUS：Dual screen face recognition machine NEXT：A magic face detection algorithm