News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
The perceptron of the basic algorithm of AI artificial intelligence
The perceptron of the basic algorithm of AI artificial intelligence
1. introduction
The perceptron (perceptron) is a relatively simple two classifier. By learning the training sample set, the weight value of the discriminant function is obtained, and the linear separable sample discriminant function is produced. The algorithm is a nonparametric algorithm. The advantage is that it does not need any assumptions about the statistical properties of all kinds of samples and belongs to deterministic methods.
Although it is simple, the perceptron algorithm is the basis of support vector machine (SVM) and neural network algorithm. In order to learn better later, the perceptron algorithm must be very clear about it. Next, we begin to introduce the perceptron algorithm.
2. perceptron classification model
Given a training data set T={(x1, Y1), (X2, Y3), (xN, yN)}T={(x1, Y1), (X2, Y3)...
The classification model is,
H theta (x) =g (theta Tx+b)
H theta (x) =g (theta Tx+b)
Theta is a weight vector, X is an input vector, and B is a deviation.
G (z) ={1z > 0, - 1z<0.
G (z) ={1z > 0, - 1z<0.
Intuitively, for an object, there are many characteristics. These features are different to our classification problem. The greater the weight, the greater the correlation between the feature and the problem.
The existence of deviation B is to characterize those neglected characteristics;
From the geometric point of view, the existence of the deviation B is to make the linear equation theta Tx+b=0 theta Tx+b=0 able to take all the separated planes (when only one feature is input, the equation is a straight line; the equation is a plane when the two characteristic inputs are input, and when the characteristic input is more than 3, the equation is hyperplane).
So, if the feature selection is better, the B will be very small, because the selected feature has been able to describe the problem well; otherwise, the B will be very large, indicating that the characteristics of the selection are not good.
3. weights update rule
First, we need to define a loss function and get our weight vector theta and deviation B in the process of minimizing the loss function. Because of the classification problem, nature can not be updated according to the error of the output value and the true value, for example, the output of the partition line equation is 0.1, the other is 10000, and the output after g (z) two is 1, but it is very different.
It seems that it is a good choice to update the weight from the distance from the sample point to the separation plane. The distance from point Xixi to the separation plane is:
L1=
L1=| theta Txi+b| angular theta
But imagine that the weight vector theta and the deviation B expand the same multiplier at the same time, and the separation plane does not change, but the distance from the point to the plane is changed, so we usually make theta =1, so that the weight can not be arbitrarily expanded.
L2=, theta Txi+b
L2=| theta Txi+b|
How do you represent this absolute value? We know that for the correct sample points, there is no need to update the weights, so only the wrong points will update the weights. Since it is a misclassification point, then G (z) is not equal to the true value Yiyi, and then the range {-1,1} of their values can get - G (z) *{y_i}>0, which is - (theta Txi+b) * yi>0 - (theta Txi+b) * yi>0.
L2L2 can be replaced by ((Txi+b) * Yi * (theta Txi+b) * Yi.
So, the loss function can be defined as,
L (theta, b) = - Sigma Xi Myi (theta Txi+b)
L (theta, b) = - Sigma Xi Myi (theta Txi+b)
M is a set of misclassified points.
The loss function is found. It is only necessary to minimize the loss function.
Here we can use the gradient descent method (because of the space problem, no more detailed introduction, I will write an article in detail on the gradient descent method, the interested students, please continue to pay attention to), the loss function L (theta, b) for theta, B derivation can be obtained,
Theta L (theta, b) = = sigma Xi Myixi Myixi bL (theta, b) =
Theta L (theta, b) = = sigma Xi Myixi Myixi bL (theta, b) =
So for the misclassification points (Xi, Yi) (Xi, Yi), for theta, B is updated to,
Theta (theta) theta + ETA yixib b+ ETA Yi
Theta (theta) theta + ETA yixib b+ ETA Yi
Among them, ETA is called learning rate, which is figuratively understood, because the derivative is equivalent to direction, and there is no way to move it only if there is no step in the direction.
4. perceptron algorithm training steps
Input: training dataset T={(x1, Y1), (X2, Y3),..., (xN, yN)}T={(x1, Y1), (X2, x1),...
Among them, Xi, Rn, Yi {1,1}, i=1,2, Nxi Nxi Rn, Yi {1,1}, i=1,2, N; learning rate (0 < {1);
Output: theta, B; perceptron model h theta (x) =g (theta Tx+b) H theta (x) =g (theta Tx+b).
(1) select the initial value theta 0, B0 theta 0, B0
(2) select data (Xi, Yi) (Xi, Yi) in the training set.
(3) if Yi (theta Txi+b) < 0yi (theta Txi+b) < 0,
Theta (theta) theta + ETA yixib b+ ETA Yi
Theta (theta) theta + ETA yixib b+ ETA Yi
(4) to transfer to
The perceptron (perceptron) is a relatively simple two classifier. By learning the training sample set, the weight value of the discriminant function is obtained, and the linear separable sample discriminant function is produced. The algorithm is a nonparametric algorithm. The advantage is that it does not need any assumptions about the statistical properties of all kinds of samples and belongs to deterministic methods.
Although it is simple, the perceptron algorithm is the basis of support vector machine (SVM) and neural network algorithm. In order to learn better later, the perceptron algorithm must be very clear about it. Next, we begin to introduce the perceptron algorithm.
2. perceptron classification model
Given a training data set T={(x1, Y1), (X2, Y3), (xN, yN)}T={(x1, Y1), (X2, Y3)...
The classification model is,
H theta (x) =g (theta Tx+b)
H theta (x) =g (theta Tx+b)
Theta is a weight vector, X is an input vector, and B is a deviation.
G (z) ={1z > 0, - 1z<0.
G (z) ={1z > 0, - 1z<0.
Intuitively, for an object, there are many characteristics. These features are different to our classification problem. The greater the weight, the greater the correlation between the feature and the problem.
The existence of deviation B is to characterize those neglected characteristics;
From the geometric point of view, the existence of the deviation B is to make the linear equation theta Tx+b=0 theta Tx+b=0 able to take all the separated planes (when only one feature is input, the equation is a straight line; the equation is a plane when the two characteristic inputs are input, and when the characteristic input is more than 3, the equation is hyperplane).
So, if the feature selection is better, the B will be very small, because the selected feature has been able to describe the problem well; otherwise, the B will be very large, indicating that the characteristics of the selection are not good.
3. weights update rule
First, we need to define a loss function and get our weight vector theta and deviation B in the process of minimizing the loss function. Because of the classification problem, nature can not be updated according to the error of the output value and the true value, for example, the output of the partition line equation is 0.1, the other is 10000, and the output after g (z) two is 1, but it is very different.
It seems that it is a good choice to update the weight from the distance from the sample point to the separation plane. The distance from point Xixi to the separation plane is:
L1=
L1=| theta Txi+b| angular theta
But imagine that the weight vector theta and the deviation B expand the same multiplier at the same time, and the separation plane does not change, but the distance from the point to the plane is changed, so we usually make theta =1, so that the weight can not be arbitrarily expanded.
L2=, theta Txi+b
L2=| theta Txi+b|
How do you represent this absolute value? We know that for the correct sample points, there is no need to update the weights, so only the wrong points will update the weights. Since it is a misclassification point, then G (z) is not equal to the true value Yiyi, and then the range {-1,1} of their values can get - G (z) *{y_i}>0, which is - (theta Txi+b) * yi>0 - (theta Txi+b) * yi>0.
L2L2 can be replaced by ((Txi+b) * Yi * (theta Txi+b) * Yi.
So, the loss function can be defined as,
L (theta, b) = - Sigma Xi Myi (theta Txi+b)
L (theta, b) = - Sigma Xi Myi (theta Txi+b)
M is a set of misclassified points.
The loss function is found. It is only necessary to minimize the loss function.
Here we can use the gradient descent method (because of the space problem, no more detailed introduction, I will write an article in detail on the gradient descent method, the interested students, please continue to pay attention to), the loss function L (theta, b) for theta, B derivation can be obtained,
Theta L (theta, b) = = sigma Xi Myixi Myixi bL (theta, b) =
Theta L (theta, b) = = sigma Xi Myixi Myixi bL (theta, b) =
So for the misclassification points (Xi, Yi) (Xi, Yi), for theta, B is updated to,
Theta (theta) theta + ETA yixib b+ ETA Yi
Theta (theta) theta + ETA yixib b+ ETA Yi
Among them, ETA is called learning rate, which is figuratively understood, because the derivative is equivalent to direction, and there is no way to move it only if there is no step in the direction.
4. perceptron algorithm training steps
Input: training dataset T={(x1, Y1), (X2, Y3),..., (xN, yN)}T={(x1, Y1), (X2, x1),...
Among them, Xi, Rn, Yi {1,1}, i=1,2, Nxi Nxi Rn, Yi {1,1}, i=1,2, N; learning rate (0 < {1);
Output: theta, B; perceptron model h theta (x) =g (theta Tx+b) H theta (x) =g (theta Tx+b).
(1) select the initial value theta 0, B0 theta 0, B0
(2) select data (Xi, Yi) (Xi, Yi) in the training set.
(3) if Yi (theta Txi+b) < 0yi (theta Txi+b) < 0,
Theta (theta) theta + ETA yixib b+ ETA Yi
Theta (theta) theta + ETA yixib b+ ETA Yi
(4) to transfer to