News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
About Gradient Descent Algorithm in AI Deep Learning
About Gradient Descent Algorithm in AI Deep Learning
1 Introduction
The gradient descent method is one of the most famous optimization algorithms and is the most commonly used method to optimize neural networks so far. At the same time, the completion of various optimized gradient descent methods is included in each of the latest deep learning libraries (for example, see lasagne, caffe, and keras). However, these algorithms are usually used as a black box optimizer, and it is therefore difficult to explain the stopping practices of their advantages and disadvantages.
The
The purpose of this paper is to provide readers with an intuitive understanding of different algorithms for optimizing gradient descent to assist readers in applying these algorithms. In Part 2, we first introduce different ways of deformation of the gradient. In Part 3, we will briefly summarize the challenges that we face in the exercise. Subsequently, in Part 4, we will introduce the most commonly used optimization algorithms, including the motivation of these algorithms in dealing with the above challenges and how to derive the derivation of the updated rules. In Part 5, we will briefly discuss algorithms and frameworks for optimizing gradient descent in parallel and distributed environments. Finally, in Part 6, we will consider some other strategies that are useful for optimizing gradient descent.
The gradient descent method is one method of minimizing the objective function, in which, for the model parameter, the gradient descent method applies the objective function to update the parameters in the opposite direction of the gradient of the parameter. The size of the step used in the learning rate resolution to reach the minimum or partial minimum. That is, we descend along the slope of the objective function until we reach the bottom of the valley. If you are not familiar with the gradient descent method, you can find information about neural network optimization from http://cs231n.github.io/optimization-1/.
2 Gradient drop method deformation
The gradient descent method has 3 medium deformation modes. The difference between them is the application of several data in calculating the gradient of the objective function. Depending on the amount of data, we make trade-offs between the accuracy of the parameter update and the time required for the update process.
2.1 batch gradient method
The Vanilla gradient descent method, also known as batch gradient descent, calculates the loss function's gradient on a parameter over the entire exercise data set:
Since we need to calculate all the gradients on the entire data set when performing each update, the batch gradient descent method will be very slow. At the same time, the batch gradient descent method cannot handle data sets that exceed the memory capacity limit. The batch gradient landing method also cannot update the model online. That is, during the operation, no new sample can be added.
The code for the batch gradient descent method is as follows:
For i in range(nb_epochs):
Params_grad = evaluate_gradient(loss_function, data, params)
Params = params - learning_rate * params_grad
1
2
3
For a given number of iterations, we first apply the entire data set to calculate the loss function with respect to the gradient vector params_grad of the parametric vector params. Note that the latest deep learning library provides the function of automatic derivation, which can effectively calculate the parameter gradient. Gradient checking is a good idea if you ask yourself for gradients (see http://cs231n.github.io/neural-networks-3/ for some tips on how to check gradients correctly).
Then we apply the gradient direction and learning rate to update the parameters, and the learning rate resolution will update the parameters in how big the steps are. With respect to the convex error function, the batch gradient descent method can be guaranteed to converge to a global minimum, and for a non-convex function, it converges to a partial minimum.
2.2 Stochastic gradient landing
In contrast, stochastic gradient descent (SGD) updates parameters based on each exercise sample and tag:
With respect to large data sets, since the gradient gradient method calculates gradients for similar samples before each parameter is updated, there is redundancy in the calculation process. SGD is only performed once in each update, eliminating redundancy. Therefore, SGD is usually faster and can be used for online learning. SGD is updated frequently with high variance, incurring an objective function that presents a violent shake as shown in FIG.
Write a picture here
Figure 1: SGD shake (Source: Wikipedia)
Convergence with the batch gradient descent method will make the loss function intruder partially smaller. Due to the instability of the SGD, on the one hand, the fluctuating nature allows the SGD to jump to new and potentially better partial optima. On the other hand, this complicates the process of eventually converging to a specific minimum, as SGD will continue to shake. However, it has been shown that when we slowly reduce the learning rate, SGD has the same convergence behavior as the batch gradient descent method, and non-convex optimization and convex optimization can converge to some minimum and global minimum respectively. Compared to the batch gradient landing code, the SGD code segment only adds a layer of loops in the traversal of exercise samples and the application of gradient calculations for each sample. Note that, as explained in Section 6.1, we disrupt the exercise sample in each cycle.
For i in range(nb_epochs):
Np.random.shuffle(data)
For example in data:
Params_grad = evaluate_gradient(loss_function, example, params)
Params = params - learning_rate * params_grad
1
2
3
4
5
2.3 Low-volume gradient landing method
The low-volume gradient descent method finally separates the advantages of the above two methods, using a small batch of exercise samples at each update:
This approach, a) reduce the variance of the parameter update, so that more stable convergence results can be obtained; b) can be applied to the highly optimized matrix optimization method of the latest deep learning library to efficiently solve the gradient of each small batch of data. In general, the size of small batches of data is between 50 and 256 and can also vary depending on the application. When training a neural network model, the low-batch gradient descent method is a typical selection algorithm. When using a low-volume gradient descent method, it is also referred to as SGD. Note: In the improved SGD below, we omitted the parameters for simplicity.
In the code, instead of iterating over all the samples, we now just iterate over 50 small-volume data:
For i in range(nb_epochs):
Np.random.shuffle(data)
For batch in get_batches(data, batch_size=50):
Params_grad = evaluate_gradient(loss_function, batch, params)
Params = params - learning_rate * params_grad
1
2
3
4
5
3 Challenge
Although Vanilla's low-volume gradient descent method does not guarantee good convergence, the emphasis is that this also leaves us with the following challenges:
Choosing a suitable learning rate can be difficult. A learning rate that is too small can cause convergence to be slow, and too high a learning rate can impede convergence, causing the loss function to fluctuate near the minimum value and deviate from the minimum value.
The learning rate adjustment [17] attempts to adjust the learning rate through, for example, annealing during the exercise, ie, to decrease the learning rate according to a pre-defined strategy or when the drop between adjacent generations is less than a certain threshold. However, the strategy and threshold requirements are pre-determined and cannot conform to the characteristics of the dataset [4].
In addition, the same learning rate is applied to all parameter updates. If the data is dense and, at the same time, the frequencies of the features are very different, we may not want to update all the parameters with the same learning rate, and we perform a greater learning rate on the features with fewer presentations.
Highly non-convex error functions are commonly present in neural networks. When optimizing such functions, another key challenge is to prevent the function from breaking into countless suboptimal partial minima. Dauphin et al. [5] pointed out that in presenting this difficult practice, it is not from some minimum, but from the saddle point, that is, those are incremental in one dimension, and are incremental in the other dimension. These saddle points are usually surrounded by points with the same error. Since gradients in the desired dimension are all approximately 0, SGD is difficult to escape from these saddle points.
4 gradient landing optimization algorithm
Below, we will list some algorithms that are commonly used by deep learning communities to handle the previously mentioned challenges. We will not discuss algorithms that are not suitable for calculation in high-dimensional data sets in practice, such as second-order methods such as Newton's method.
4.1 Momentum method
It is difficult for SGD to pass through steep valleys, that is, the degree of outward bending in one dimension is much larger than that in other dimensions [19]. This condition is usually present in some of the most advantageous positions. In this situation, the SGD oscillates steeply across the slope of the steep valley, and at the same time, it travels only slowly, along the path from the bottom to part of the most advantageous route. This process is shown in Figure 2a.
Write a picture here
Figure 2: Source: Genevieve B. Orr
As shown in Fig. 2b, the momentum method [16] is a way to assist the SGD in accelerating and suppressing shaking in the relevant direction. The momentum method adds a weight to the update vector of the historical step to the current update vector (the symbol in the formula is communicated in the local completion)
The momentum item is usually set to 0.9 or a similar value.
In essence, the momentum method is like we push a ball from the hill. The ball accumulates momentum as it rolls down and gets faster and faster (straight to the final speed if there is air resistance). The same thing also happens during the updating of the parameters: for the dimension with the same direction at the gradient point, the momentum term increases, and for the dimension that changes the direction at the gradient point, the momentum term decreases. As a result, we can achieve faster convergence speeds while reducing shake.
4.2 Nesterov accelerated gradient descent method
However, when the ball rolls down from the hill and consciously follows the direction of the slope, it is often not desirable. We want to have a smart ball that knows where it will go so that we can know how to slow down when we get a new slope.
The Nesterov accelerated gradient (NAG) [13] is a way to give momentum to predictability. We know that we apply momentum items to update parameters. The calculation can inform us of an approximation of the future position of the parameter (the gradient is not a complete update), which is to inform us that the parameter will become roughly the same. After calculating the gradient of the approximate position of the parameter in the future, instead of the gradient of the current parameter, we can efficiently solve:
At the same time, we set the momentum term to about 0.9. The momentum method first calculates the current gradient value (small blue vector in Figure 3) and then makes a big step in the direction of the newer cumulative gradient (large blue vector). Nesterov accelerated gradient descent method NAG first accumulated in the previous A large step in the direction of the gradient (brown vector), calculate the gradient value, and then make a correction (green vector). This predictable update prevents us from going too fast and enhancing the ability of the algorithm to respond. This has important implications for the performance improvement of the RNN in many tasks [2].
Write a picture here
Figure 3: Nesterov Update (Source: Course 6c by G. Hinton)
Another explanation of the intuitive understanding of NAG can be found at http://cs231n.github.io/neural-networks-3/, while Ilya Sutskever gives a more detailed overview in his doctoral thesis [18].
Since we can adapt our update to the slope of the error function to accelerate the SGD accordingly, we also want to make our update comply with each individual parameter to resolve large or small updates depending on the importance of each parameter.
4.3 Adagrad
Adagrad [7] is a kind of gradient-based optimization algorithm: let the learning rate adapt to the parameters, we use a larger learning rate for the features with fewer presentation times, and we have more Use a smaller learning rate. Therefore, Adagrad is very suitable for handling dense data. Dean et al. [6] found that Adagrad can greatly improve the robustness of SGD and apply it to Google's extensive neural network exercise, which includes the identification of cats in YouTube videos. In addition, Pennington et al. [15] applied Adagrad to exercise the Glove word vector because the low-frequency words require larger steps than the high-frequency words.
The gradient descent method is one of the most famous optimization algorithms and is the most commonly used method to optimize neural networks so far. At the same time, the completion of various optimized gradient descent methods is included in each of the latest deep learning libraries (for example, see lasagne, caffe, and keras). However, these algorithms are usually used as a black box optimizer, and it is therefore difficult to explain the stopping practices of their advantages and disadvantages.
The
The purpose of this paper is to provide readers with an intuitive understanding of different algorithms for optimizing gradient descent to assist readers in applying these algorithms. In Part 2, we first introduce different ways of deformation of the gradient. In Part 3, we will briefly summarize the challenges that we face in the exercise. Subsequently, in Part 4, we will introduce the most commonly used optimization algorithms, including the motivation of these algorithms in dealing with the above challenges and how to derive the derivation of the updated rules. In Part 5, we will briefly discuss algorithms and frameworks for optimizing gradient descent in parallel and distributed environments. Finally, in Part 6, we will consider some other strategies that are useful for optimizing gradient descent.
The gradient descent method is one method of minimizing the objective function, in which, for the model parameter, the gradient descent method applies the objective function to update the parameters in the opposite direction of the gradient of the parameter. The size of the step used in the learning rate resolution to reach the minimum or partial minimum. That is, we descend along the slope of the objective function until we reach the bottom of the valley. If you are not familiar with the gradient descent method, you can find information about neural network optimization from http://cs231n.github.io/optimization-1/.
2 Gradient drop method deformation
The gradient descent method has 3 medium deformation modes. The difference between them is the application of several data in calculating the gradient of the objective function. Depending on the amount of data, we make trade-offs between the accuracy of the parameter update and the time required for the update process.
2.1 batch gradient method
The Vanilla gradient descent method, also known as batch gradient descent, calculates the loss function's gradient on a parameter over the entire exercise data set:
Since we need to calculate all the gradients on the entire data set when performing each update, the batch gradient descent method will be very slow. At the same time, the batch gradient descent method cannot handle data sets that exceed the memory capacity limit. The batch gradient landing method also cannot update the model online. That is, during the operation, no new sample can be added.
The code for the batch gradient descent method is as follows:
For i in range(nb_epochs):
Params_grad = evaluate_gradient(loss_function, data, params)
Params = params - learning_rate * params_grad
1
2
3
For a given number of iterations, we first apply the entire data set to calculate the loss function with respect to the gradient vector params_grad of the parametric vector params. Note that the latest deep learning library provides the function of automatic derivation, which can effectively calculate the parameter gradient. Gradient checking is a good idea if you ask yourself for gradients (see http://cs231n.github.io/neural-networks-3/ for some tips on how to check gradients correctly).
Then we apply the gradient direction and learning rate to update the parameters, and the learning rate resolution will update the parameters in how big the steps are. With respect to the convex error function, the batch gradient descent method can be guaranteed to converge to a global minimum, and for a non-convex function, it converges to a partial minimum.
2.2 Stochastic gradient landing
In contrast, stochastic gradient descent (SGD) updates parameters based on each exercise sample and tag:
With respect to large data sets, since the gradient gradient method calculates gradients for similar samples before each parameter is updated, there is redundancy in the calculation process. SGD is only performed once in each update, eliminating redundancy. Therefore, SGD is usually faster and can be used for online learning. SGD is updated frequently with high variance, incurring an objective function that presents a violent shake as shown in FIG.
Write a picture here
Figure 1: SGD shake (Source: Wikipedia)
Convergence with the batch gradient descent method will make the loss function intruder partially smaller. Due to the instability of the SGD, on the one hand, the fluctuating nature allows the SGD to jump to new and potentially better partial optima. On the other hand, this complicates the process of eventually converging to a specific minimum, as SGD will continue to shake. However, it has been shown that when we slowly reduce the learning rate, SGD has the same convergence behavior as the batch gradient descent method, and non-convex optimization and convex optimization can converge to some minimum and global minimum respectively. Compared to the batch gradient landing code, the SGD code segment only adds a layer of loops in the traversal of exercise samples and the application of gradient calculations for each sample. Note that, as explained in Section 6.1, we disrupt the exercise sample in each cycle.
For i in range(nb_epochs):
Np.random.shuffle(data)
For example in data:
Params_grad = evaluate_gradient(loss_function, example, params)
Params = params - learning_rate * params_grad
1
2
3
4
5
2.3 Low-volume gradient landing method
The low-volume gradient descent method finally separates the advantages of the above two methods, using a small batch of exercise samples at each update:
This approach, a) reduce the variance of the parameter update, so that more stable convergence results can be obtained; b) can be applied to the highly optimized matrix optimization method of the latest deep learning library to efficiently solve the gradient of each small batch of data. In general, the size of small batches of data is between 50 and 256 and can also vary depending on the application. When training a neural network model, the low-batch gradient descent method is a typical selection algorithm. When using a low-volume gradient descent method, it is also referred to as SGD. Note: In the improved SGD below, we omitted the parameters for simplicity.
In the code, instead of iterating over all the samples, we now just iterate over 50 small-volume data:
For i in range(nb_epochs):
Np.random.shuffle(data)
For batch in get_batches(data, batch_size=50):
Params_grad = evaluate_gradient(loss_function, batch, params)
Params = params - learning_rate * params_grad
1
2
3
4
5
3 Challenge
Although Vanilla's low-volume gradient descent method does not guarantee good convergence, the emphasis is that this also leaves us with the following challenges:
Choosing a suitable learning rate can be difficult. A learning rate that is too small can cause convergence to be slow, and too high a learning rate can impede convergence, causing the loss function to fluctuate near the minimum value and deviate from the minimum value.
The learning rate adjustment [17] attempts to adjust the learning rate through, for example, annealing during the exercise, ie, to decrease the learning rate according to a pre-defined strategy or when the drop between adjacent generations is less than a certain threshold. However, the strategy and threshold requirements are pre-determined and cannot conform to the characteristics of the dataset [4].
In addition, the same learning rate is applied to all parameter updates. If the data is dense and, at the same time, the frequencies of the features are very different, we may not want to update all the parameters with the same learning rate, and we perform a greater learning rate on the features with fewer presentations.
Highly non-convex error functions are commonly present in neural networks. When optimizing such functions, another key challenge is to prevent the function from breaking into countless suboptimal partial minima. Dauphin et al. [5] pointed out that in presenting this difficult practice, it is not from some minimum, but from the saddle point, that is, those are incremental in one dimension, and are incremental in the other dimension. These saddle points are usually surrounded by points with the same error. Since gradients in the desired dimension are all approximately 0, SGD is difficult to escape from these saddle points.
4 gradient landing optimization algorithm
Below, we will list some algorithms that are commonly used by deep learning communities to handle the previously mentioned challenges. We will not discuss algorithms that are not suitable for calculation in high-dimensional data sets in practice, such as second-order methods such as Newton's method.
4.1 Momentum method
It is difficult for SGD to pass through steep valleys, that is, the degree of outward bending in one dimension is much larger than that in other dimensions [19]. This condition is usually present in some of the most advantageous positions. In this situation, the SGD oscillates steeply across the slope of the steep valley, and at the same time, it travels only slowly, along the path from the bottom to part of the most advantageous route. This process is shown in Figure 2a.
Write a picture here
Figure 2: Source: Genevieve B. Orr
As shown in Fig. 2b, the momentum method [16] is a way to assist the SGD in accelerating and suppressing shaking in the relevant direction. The momentum method adds a weight to the update vector of the historical step to the current update vector (the symbol in the formula is communicated in the local completion)
The momentum item is usually set to 0.9 or a similar value.
In essence, the momentum method is like we push a ball from the hill. The ball accumulates momentum as it rolls down and gets faster and faster (straight to the final speed if there is air resistance). The same thing also happens during the updating of the parameters: for the dimension with the same direction at the gradient point, the momentum term increases, and for the dimension that changes the direction at the gradient point, the momentum term decreases. As a result, we can achieve faster convergence speeds while reducing shake.
4.2 Nesterov accelerated gradient descent method
However, when the ball rolls down from the hill and consciously follows the direction of the slope, it is often not desirable. We want to have a smart ball that knows where it will go so that we can know how to slow down when we get a new slope.
The Nesterov accelerated gradient (NAG) [13] is a way to give momentum to predictability. We know that we apply momentum items to update parameters. The calculation can inform us of an approximation of the future position of the parameter (the gradient is not a complete update), which is to inform us that the parameter will become roughly the same. After calculating the gradient of the approximate position of the parameter in the future, instead of the gradient of the current parameter, we can efficiently solve:
At the same time, we set the momentum term to about 0.9. The momentum method first calculates the current gradient value (small blue vector in Figure 3) and then makes a big step in the direction of the newer cumulative gradient (large blue vector). Nesterov accelerated gradient descent method NAG first accumulated in the previous A large step in the direction of the gradient (brown vector), calculate the gradient value, and then make a correction (green vector). This predictable update prevents us from going too fast and enhancing the ability of the algorithm to respond. This has important implications for the performance improvement of the RNN in many tasks [2].
Write a picture here
Figure 3: Nesterov Update (Source: Course 6c by G. Hinton)
Another explanation of the intuitive understanding of NAG can be found at http://cs231n.github.io/neural-networks-3/, while Ilya Sutskever gives a more detailed overview in his doctoral thesis [18].
Since we can adapt our update to the slope of the error function to accelerate the SGD accordingly, we also want to make our update comply with each individual parameter to resolve large or small updates depending on the importance of each parameter.
4.3 Adagrad
Adagrad [7] is a kind of gradient-based optimization algorithm: let the learning rate adapt to the parameters, we use a larger learning rate for the features with fewer presentation times, and we have more Use a smaller learning rate. Therefore, Adagrad is very suitable for handling dense data. Dean et al. [6] found that Adagrad can greatly improve the robustness of SGD and apply it to Google's extensive neural network exercise, which includes the identification of cats in YouTube videos. In addition, Pennington et al. [15] applied Adagrad to exercise the Glove word vector because the low-frequency words require larger steps than the high-frequency words.