News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
About machine learning in artificial intelligence
About machine learning in artificial intelligence
1 Definition and focus of machine learning
Machine learning is devoted to researching how to use experience to improve the performance of the system itself.
Experience - data
Calculation - machine learning algorithm
With a machine learning algorithm, empirical data is provided to it to get the model.
Authoritative definition:
Assume that P is used to evaluate the performance of a computer program on a task T. If a program achieves performance improvement on task T through experience E, then we say about T and P, which learns about E.
2 terms
sample
feature
Eigenvalues
Assumption (hypothetical space)
Truth (ground-truth)
Training sample & test sample
Classification & Regression & Clustering
Supervised learning & unsupervised learning
Generalization: Focus on the performance of a learned machine learning model that has not been seen before. The better the generalization performance, the better the model's universality.
Prerequisites: The training samples we obtained are only a small part of the sample space, but we hope that the final model will have better performance in the entire sample space. The purpose of this is based on a hypothesis – independent The distribution (iid), that is, all the samples in the sample space are consistent with the same distribution and independent of each other, so that even if you only give a small part of the sample to do the training set, as long as you can find the internal running rules, you will The entire sample space can be better modeled, and a model with better generalization ability is obtained. The iid hypothesis is too strong. In many cases, due to the limited training data set obtained, we may just be partial and do not see the whole picture of the world.
Hypothesis Space & Version Space: All possible situations constitute a hypothesis space, such as a watermelon classification problem, which can be solved by logistic regression, or can be solved by decision tree, SVM, etc. Each algorithm can have more than N parameter combinations, although these The model is correct, there is something wrong, but these assumptions are all possible, then they all belong to the hypothesis space. In other words, it is assumed to be bold. But blindly, it’s a coward. What are we going to do next? We must be careful to verify and do it. First, we must remove the assumptions that are inconsistent with the actual situation from the hypothesis space. What the actual situation means is that we have taken it at hand. After the training data is removed and the assumptions that do not match the reality are removed, the rest are more reliable assumptions that make up the version space.
Inductive Preference: The preference of machine learning algorithms for certain types of hypotheses during the learning process. Any machine learning algorithm has its own preferences. Inductive preferences can be thought of as heuristics or "values" in which the learning algorithm itself chooses hypotheses in a potentially large version of the space, and is a priori information. The Occam razor is a commonly used principle, that is, "If there are multiple models and the observed training samples are consistent, choose the simplest one." Occam's razor is not the only principle of inductive preference. Even the "simple" mentioned in Occam's razor can be explained in many ways. Therefore, in the specific algorithm implementation, whether the inductive preference of the algorithm matches the problem itself directly determines whether the algorithm can achieve good performance.
There is no free lunch in the world: no algorithm has better performance on all tasks than another algorithm. It is meaningless to talk about the algorithm better than the actual problem. When comparing the performance of the algorithm, it is necessary to analyze the specific problem. Whether the inductive preference of the learning algorithm itself matches the problem can often play a decisive role.
3 Model evaluation and selection
3.1 Empirical error and overfitting
The deviation between the predicted results of the model and the actual results is called the error.
The prediction error about the training data is called "training error" or "empirical error"; the prediction error about the new sample is called "generalization error".
The ultimate goal is to get a model with a small generalization error. However, in the process of model training, I don't know what the new sample looks like. I can only exert force on the training sample set, and try to get a model with small error on the training sample set. Of course, there may be doubts. If you perform well on the training sample set, you will be able to perform on the new sample. Later we will see the proof process. The probability that the difference between the error and the generalization error on the training sample set is small is very large. That is to say, if the training error of a model is small, the large probability will also be expressed on the new sample. Not bad. However, if the performance on the training sample set is too good, there will be demon in the abnormal situation, so often it is not very good on the new sample set.
If we visualize the above concepts, we hope that the model can learn the "universal laws" behind the training data set in the learning process, but the training set has unique and more individual things, or we hope that our model can be ignored. they. The requirements are still quite high, asking our model to be wise, able to understand the avenue, and not be confused by the little tricks, artificial intelligence!
The terminology indicates this problem, which is “under-fitting” and “over-fitting”. "Unfitting" means that the model learning ability is insufficient, and the general law behind the training set is not learned; "over-fitting" means that the model learning ability is too strong, and the personality of the training set is also regarded as a universal law. Over-fitting is a difficult point in machine learning. Over-fitting can't be avoided and can only be alleviated. There are regularizations, dropouts, and so on.
Write a picture description here
3.2 Evaluation method
Usually, after learning the model based on the training set, we need to test the performance of the model based on the test sample, and use the test error as an approximation of the generalization error. It should be noted that the test sample must be a sample that has never appeared in the training sample set, otherwise we will get too "idealized" test results.
However, at the beginning of training a model, we only got one training data set. How to use this data to train and test? The correct method is to divide the training data set into two parts: the training set and the test set. The common methods of division are as follows.
3.2.1 Set aside
The idea of leaving the law is to divide the training set into two disjoint subsets, one for training and one for testing. Typically 70% of the original data set is used as the training set and the remaining 30% is used as the test set.
Some things to note when setting aside the law are:
Hierarchical sampling: When training/test set partitioning, it is necessary to maintain the consistency of the original sample distribution as much as possible to avoid introducing bias in the sampling process. For example, if the original data set contains 500 positive samples and 500 negative samples, then in the training/test sample partitioning, it should be ensured that the training set contains 350 positive samples and 350 negative samples, while the test set should contain 150 positive samples and 150. Negative samples, this avoids introducing new biases in the data partitioning process.
Repeat calculation: According to the above example, 350 of the 500 positive samples are used as the training set, and there are C350500C500350 selection methods. Different training sample sets have certain influence on the final test results. The usual practice is to use multiple random screenings, repeat the experiment and take the mean as the final result of the retention method.
There is also a problem with the set-out method. If the training set retains most of the samples of the original data set, the trained model is closer to the model trained using the complete data set, but at this time the test set is too small for performance evaluation. The variance is large; if the proportion of the training set sample to the original data set sample is reduced, the model using the training set and training using the original data set will have a large deviation. There is no complete solution to this problem. Generally, 23 to 4523 to 45 of the original data set are selected for training, and the rest are used as test sets.
3.2.2 Cross-validation
Cross-validation divides the data set into K equal parts, mutually exclusive and maintains the consistency of data partitioning as much as possible (each data subset is obtained by hierarchical sampling). Each time (K-1) data is selected as the training set, and the remaining data is used as a test set, so that K training and testing processes can be performed, and the mean of the K test results is finally returned as a single cross-validation. As a result, it is generally referred to as K-fold cross-validation. The common value of K is 10.
Write a picture description here
Also in order to reduce the impact of data selection on the final performance, many times, K-fold cross-validation is performed multiple times, for example, 20 times of 10-fold cross-validation, and the final test result is the mean of the 200 test results.
Leave one method: When K=m is selected, only one sample is reserved as a test set at a time, so that the test set of each training model is close to the original data set, and the performance evaluation is also accurate. Of course, the volatility of performance evaluation is also bigger. The advantage of leaving one method is accuracy. The disadvantage is that if the value of m is large, many models need to be trained, and the computational cost is relatively large.
3.2.3 Self-help method
The self-help method has put back sampling, that is, randomly select m sample samples from the original data set and put them into the training set, so that some samples in the original data set appear in the training set multiple times, and some appear and will not appear in the training set. The sample is used as a test set. Do this by using approximately 2323 raw samples as the training set and the remaining samples as the test set.
The self-service method is more effective when the original data set is small and it is difficult to effectively divide the training set/test set. However, the self-help method changes the original distribution of the data and introduces additional bias. Therefore, the model is often evaluated by the method of leaving and cross-validation when the number of samples is sufficient.
Machine learning is devoted to researching how to use experience to improve the performance of the system itself.
Experience - data
Calculation - machine learning algorithm
With a machine learning algorithm, empirical data is provided to it to get the model.
Authoritative definition:
Assume that P is used to evaluate the performance of a computer program on a task T. If a program achieves performance improvement on task T through experience E, then we say about T and P, which learns about E.
2 terms
sample
feature
Eigenvalues
Assumption (hypothetical space)
Truth (ground-truth)
Training sample & test sample
Classification & Regression & Clustering
Supervised learning & unsupervised learning
Generalization: Focus on the performance of a learned machine learning model that has not been seen before. The better the generalization performance, the better the model's universality.
Prerequisites: The training samples we obtained are only a small part of the sample space, but we hope that the final model will have better performance in the entire sample space. The purpose of this is based on a hypothesis – independent The distribution (iid), that is, all the samples in the sample space are consistent with the same distribution and independent of each other, so that even if you only give a small part of the sample to do the training set, as long as you can find the internal running rules, you will The entire sample space can be better modeled, and a model with better generalization ability is obtained. The iid hypothesis is too strong. In many cases, due to the limited training data set obtained, we may just be partial and do not see the whole picture of the world.
Hypothesis Space & Version Space: All possible situations constitute a hypothesis space, such as a watermelon classification problem, which can be solved by logistic regression, or can be solved by decision tree, SVM, etc. Each algorithm can have more than N parameter combinations, although these The model is correct, there is something wrong, but these assumptions are all possible, then they all belong to the hypothesis space. In other words, it is assumed to be bold. But blindly, it’s a coward. What are we going to do next? We must be careful to verify and do it. First, we must remove the assumptions that are inconsistent with the actual situation from the hypothesis space. What the actual situation means is that we have taken it at hand. After the training data is removed and the assumptions that do not match the reality are removed, the rest are more reliable assumptions that make up the version space.
Inductive Preference: The preference of machine learning algorithms for certain types of hypotheses during the learning process. Any machine learning algorithm has its own preferences. Inductive preferences can be thought of as heuristics or "values" in which the learning algorithm itself chooses hypotheses in a potentially large version of the space, and is a priori information. The Occam razor is a commonly used principle, that is, "If there are multiple models and the observed training samples are consistent, choose the simplest one." Occam's razor is not the only principle of inductive preference. Even the "simple" mentioned in Occam's razor can be explained in many ways. Therefore, in the specific algorithm implementation, whether the inductive preference of the algorithm matches the problem itself directly determines whether the algorithm can achieve good performance.
There is no free lunch in the world: no algorithm has better performance on all tasks than another algorithm. It is meaningless to talk about the algorithm better than the actual problem. When comparing the performance of the algorithm, it is necessary to analyze the specific problem. Whether the inductive preference of the learning algorithm itself matches the problem can often play a decisive role.
3 Model evaluation and selection
3.1 Empirical error and overfitting
The deviation between the predicted results of the model and the actual results is called the error.
The prediction error about the training data is called "training error" or "empirical error"; the prediction error about the new sample is called "generalization error".
The ultimate goal is to get a model with a small generalization error. However, in the process of model training, I don't know what the new sample looks like. I can only exert force on the training sample set, and try to get a model with small error on the training sample set. Of course, there may be doubts. If you perform well on the training sample set, you will be able to perform on the new sample. Later we will see the proof process. The probability that the difference between the error and the generalization error on the training sample set is small is very large. That is to say, if the training error of a model is small, the large probability will also be expressed on the new sample. Not bad. However, if the performance on the training sample set is too good, there will be demon in the abnormal situation, so often it is not very good on the new sample set.
If we visualize the above concepts, we hope that the model can learn the "universal laws" behind the training data set in the learning process, but the training set has unique and more individual things, or we hope that our model can be ignored. they. The requirements are still quite high, asking our model to be wise, able to understand the avenue, and not be confused by the little tricks, artificial intelligence!
The terminology indicates this problem, which is “under-fitting” and “over-fitting”. "Unfitting" means that the model learning ability is insufficient, and the general law behind the training set is not learned; "over-fitting" means that the model learning ability is too strong, and the personality of the training set is also regarded as a universal law. Over-fitting is a difficult point in machine learning. Over-fitting can't be avoided and can only be alleviated. There are regularizations, dropouts, and so on.
Write a picture description here
3.2 Evaluation method
Usually, after learning the model based on the training set, we need to test the performance of the model based on the test sample, and use the test error as an approximation of the generalization error. It should be noted that the test sample must be a sample that has never appeared in the training sample set, otherwise we will get too "idealized" test results.
However, at the beginning of training a model, we only got one training data set. How to use this data to train and test? The correct method is to divide the training data set into two parts: the training set and the test set. The common methods of division are as follows.
3.2.1 Set aside
The idea of leaving the law is to divide the training set into two disjoint subsets, one for training and one for testing. Typically 70% of the original data set is used as the training set and the remaining 30% is used as the test set.
Some things to note when setting aside the law are:
Hierarchical sampling: When training/test set partitioning, it is necessary to maintain the consistency of the original sample distribution as much as possible to avoid introducing bias in the sampling process. For example, if the original data set contains 500 positive samples and 500 negative samples, then in the training/test sample partitioning, it should be ensured that the training set contains 350 positive samples and 350 negative samples, while the test set should contain 150 positive samples and 150. Negative samples, this avoids introducing new biases in the data partitioning process.
Repeat calculation: According to the above example, 350 of the 500 positive samples are used as the training set, and there are C350500C500350 selection methods. Different training sample sets have certain influence on the final test results. The usual practice is to use multiple random screenings, repeat the experiment and take the mean as the final result of the retention method.
There is also a problem with the set-out method. If the training set retains most of the samples of the original data set, the trained model is closer to the model trained using the complete data set, but at this time the test set is too small for performance evaluation. The variance is large; if the proportion of the training set sample to the original data set sample is reduced, the model using the training set and training using the original data set will have a large deviation. There is no complete solution to this problem. Generally, 23 to 4523 to 45 of the original data set are selected for training, and the rest are used as test sets.
3.2.2 Cross-validation
Cross-validation divides the data set into K equal parts, mutually exclusive and maintains the consistency of data partitioning as much as possible (each data subset is obtained by hierarchical sampling). Each time (K-1) data is selected as the training set, and the remaining data is used as a test set, so that K training and testing processes can be performed, and the mean of the K test results is finally returned as a single cross-validation. As a result, it is generally referred to as K-fold cross-validation. The common value of K is 10.
Write a picture description here
Also in order to reduce the impact of data selection on the final performance, many times, K-fold cross-validation is performed multiple times, for example, 20 times of 10-fold cross-validation, and the final test result is the mean of the 200 test results.
Leave one method: When K=m is selected, only one sample is reserved as a test set at a time, so that the test set of each training model is close to the original data set, and the performance evaluation is also accurate. Of course, the volatility of performance evaluation is also bigger. The advantage of leaving one method is accuracy. The disadvantage is that if the value of m is large, many models need to be trained, and the computational cost is relatively large.
3.2.3 Self-help method
The self-help method has put back sampling, that is, randomly select m sample samples from the original data set and put them into the training set, so that some samples in the original data set appear in the training set multiple times, and some appear and will not appear in the training set. The sample is used as a test set. Do this by using approximately 2323 raw samples as the training set and the remaining samples as the test set.
The self-service method is more effective when the original data set is small and it is difficult to effectively divide the training set/test set. However, the self-help method changes the original distribution of the data and introduces additional bias. Therefore, the model is often evaluated by the method of leaving and cross-validation when the number of samples is sufficient.