Random forest method in machine learning

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

1. Introduction to Random Forest Principles
Random forest refers to a classifier that uses multiple trees to stop exercising and predicting samples.
In simple terms, random forests consist of multiple CARTs (Classification And Regression Trees).
With regard to each tree, the exercise set that they use is sampled from the total exercise set. This means that some samples of the total exercise set may be repeatedly presented in a tree's exercise set, or may never be presented. Exercise concentration in a tree. In the exercise of the nodes of each tree, the characteristic used is the random extraction of all features from a certain proportion,
According to Leo Breiman's proposal, assuming that the total number of features is N, this ratio can be
a. sqrt(M),
b. 1/2sqrt(N)
c. 2sqrt(N)
Thus, the random forest exercise process can be summarized as follows:
(1) Given exercise set S, test set T, and feature dimension F. Affirmative parameters: the number of applied CART t, the depth d of each tree, the number of features applied to each node f, termination conditions: the most on the node
Small sample number s, the minimum information gain at the node m
(2) From the S, there is a back-sucking exercise set S(i) of the same size as the S, as a sample of the root node, starting from the root node.
(3) If the termination condition is reached on the current node, the current node is set as a leaf node. If it is a classification problem, the predicted output of the leaf node is the one with the highest number of samples in the current node.
The class c(j), the probability p is the ratio of c(j) to the current sample set; if it is a regression problem, the predicted output is the uniform value of each sample value of the current node sample set. Then continue to exercise other nodes. If
When the front node does not reach the termination condition, the f-dimensional feature is randomly selected from the F-dimensional features without replacement. Apply this f-dimensional feature, find the best one-dimensional feature k and its threshold th, at the current node
Samples with the k-th dimension of the sample less than th are partitioned into the left node and others are divided into the right node. Continue to exercise other nodes. The criteria for judging the classification effect will be discussed later.
(4) Repeat (2) (3) until all nodes have been exercised or marked as leaf nodes.
(5) Repeat (2), (3), (4) until all CART has been exercised.
The process of applying random forests is as follows:
About the 1-t tree, i=1-t:
(1) From the beginning of the root node of the current tree, according to the threshold th of the current node, it is determined whether to enter the left node (<th) or enter the right node (>=th) until reaching a certain leaf node and outputting the predicted value.
(2) Repeatedly perform (1) until all t-trees have output predicted values. If it is a classification problem, then the output is the one with the largest sum of the predicted probabilities in all trees, ie the p-accumulation of each c(j); if it is a regression problem, the output is the uniform value of the output of all trees.
Note: As for the criterion for the classification effect, since CART is used, the criterion for CART is also used, which is different from C3.0 and C4.5.
Regarding the classification problem (dividing a sample into a certain class), that is, the discrete variable problem, CART uses the Gini value as a criterion for judgment.
Defined as Gini=1-∑(P(i)*P(i)), P(i) is the proportion of the i-th sample in the data set on the current node. For example, there are 2 types, there are 100 samples on the current node, 70 samples belong to the first type, and 30 samples belong to the second type, then Gini=1-0.7×07-0.3×03=0.42, It can be seen that the more even the distribution of the categories is, the larger the Gini value is, and the more uneven the divergence distribution is, the smaller the Gini value is. When searching for the best classification feature and threshold, the evaluation criterion is: argmax(Gini-GiniLeft-GiniRight), that is, find the best feature f and the threshold th, so that the current node's Gini value minus the left child's Gini and right The child node has the largest Gini value.
On the regression problem, it is relatively simpler to directly use argmax (Var-VarLeft-VarRight) as a criterion for judgment, that is, the variance Var of the current node's exercise set minus the variance VarLeft of the left child node and the variance VarRight value of the right child node are the largest.
2. Complete
Random Forest Principles and Applications:
Principle: Random Forest is an ensemble learning approach that combines multiple weak classifiers into a strong classifier.
Bagging VS Boosting: The similarities between the two are that the same type of classifiers are combined, and there is a back-to-back sampling of a certain percentage of samples for each weak classifier. The difference is that the boosting exercise is orderly, and the new classifier exercise is based on the prediction effect of the previous classifier. GBDT uses the technique of boosting and RandomForest uses Bagging technology.
Construction of RandomForest: Construct K decision trees in parallel. Each decision tree randomly selects a sample with a specified ratio p(ie 0.6) and a feature with a specified ratio q(ie 0.5) to stop exercising.
RandomForest's prediction: multiple weak classifiers voted into the final classification results, see Figure 1

Fig. 1 Random Forest Prediction
Exercise tuning
Tuning stops mainly from three dimensions: samples, features, models, and parameters
Feature tuning:
First, on a small number of samples (a few K), the exercise data is also used as predictive data to stop the test, and the feature is added to view the effect. The number of begin features is 11. After that, the effect is improved after adding 20, and continue to add. When the new feature is in effect, the effect is not obvious. The feature is temporarily confirmed for the 20 items.
Sample adjustments:
For the problem of classification imbalance, there are mainly two types of samples with too many numbers. We have made rules to filter out some of the low-dose samples, and at the same time used more reliable label data.
Increase the number of 0 and 1 samples to try to make the three types of samples 1:1:1
Exercise on a small number of samples presents an overfitting problem. As shown in Fig. 4, a small number of samples on the train data are also found as test data. The accuracy and masking rate are found to be less than 1% error, but the error reaches 30% when changing to another test data. To increase the sample size of exercise, the effect is improved again.
Adjustment of models and parameters:
Simple comparison of GBDT and RF, SVM effect, RF best effect, RF acquiescence use of 200 trees, each tree random use of 60% of the sample, 60% of the feature. Adjusted the proportion of sample ration and feature ratio, have little effect on the final effect.
CV Research RF Indicators

PREVIOUS：The difference between machine learning NEXT：Machine Learning Neural Network