AI machine learning decision tree

News classification

Contact us

Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
Tel: 13146317170 廖经理
Fax:
Email: 398017534@qq.com

First, information entropy

1, Bits

When the probability of random variables is equal probability, two bits can be used to represent a random variable.

E.g. P(X=A)=P(X=B)=P(X=C)=P(X=D)=1/4

E(X)=(1/4*2) * 4=-log2(1/4) *(1/4) * 4=2

When the probability value of each variable in the random variable is different, the variable with a high probability of occurrence is represented by fewer bits; the variable with a smaller probability of occurrence is represented by more bits.

E.g. P(X=A)=1/2, P(X=B)=1/4, P(X=C)=P(X=D)=1/8

E(X)=1/2*1+1/4*2+1/8*3+1/8*3

=-log2(1/2)*(1/2)-log2(1/4)*(1/4)-log2(1/8)*(1/8)*2=7/4

Let the random variable X have m possible values, namely V1, V2,..., Vm, and the probability of occurrence of each value is p1, p2,..., pm, then for this set of information sequences, You can use the expectations of variables to describe how many bits each variable needs to describe information:



2, information entropy (information entropy)

Information volume: refers to the information contained in an event. If the probability of an event is greater, the information contained in the event is considered to be less. The probability of an event with a probability of 1/determining the occurrence of an event carries 0 (no uncertainty is removed).

Information entropy: describes the uncertainty of the system information, information entropy can be used as a measure of the complexity of the system, if the system is more complex, more disorder, there are more types of different situations, then its information entropy is relatively large. The non-uniform distribution is less uniform than the even distribution of entropy, so when bitwise, a shorter code can be used to describe events with higher probability, and a longer code can be used to describe events with lower probability, so that the total code length can be shorter.

In the source, it is not the uncertainties of the occurrence of a single symbol, but the average uncertainty of all possible occurrences of this source. If the source symbol has n kinds of values, corresponding to different probabilities, and the appearance of various symbols are independent of each other. At this time, the average uncertainty of the source should be the expectation of a single symbol uncertainty - logPi. The amount of information that satisfies low-probability events corresponds to a high amount of information, and the base 2 is a universal tradition of following information-theoretic coding problems.

Therefore, the information entropy of the random variable X can be expressed as:



Pi is the probability of selecting this category. The smaller the H value, the higher the purity of X.

The nature of information entropy:

(1) Monotonicity

The higher the probability of an event is, the lower the information entropy it carries.

(2) Non-negative

Information entropy cannot be negative

(3) Accumulation

Multiple independent random events occur at the same time, and the measure of total uncertainty can be expressed as the sum of the metrics for each event's uncertainty.

p(A,B)=p(A)p(B) H(A,B)=H(A)+H(B)

Shannon proved mathematically that the uncertainty measure function of random variables satisfying the above three conditions has a unique form, which is the information entropy formula.

3, conditional entropy

Given the condition X (the value of each of the Xs must be taken), the entropy of the conditional probability distribution of another random variable Y is expected of X.

Conditional entropy is equivalent to the joint entropy H (X, Y) minus the entropy of event X alone, that is, the "new" entropy brought about by Y when the event X occurs.



The relationship between mutual information, conditional entropy and joint entropy:



H(X,Y) is joint entropy and I(X;Y) is mutual information (information gain)





4, cross-entropy

It is used to measure the difference information between two probability distributions. In information theory, cross entropy represents two probability distributions p(x), q(x), where p(x) denotes true distribution and q(x) denotes non- The true distribution, in the same set of events, uses the non-true distribution q to represent the average number of bits required for an event to occur.

That is, if the actual distribution p(x) is used to measure the expected length of the code required to identify a sample:



If the average code length from the real distribution p(x) is represented by a non-true distribution q(x), it is:

- Cross-entropy

Cross entropy can be used to measure the amount of effort required to eliminate system uncertainty using the strategy specified by the non-true distribution under a given true distribution.

Second, the decision tree model

Decision Tree is a commonly used supervised classification algorithm. It is a method of constructing a decision tree for analysis based on the probability of occurrence of various conditions. It is an intuitive application of probabilistic analysis. A graphic method. The decision tree only needs to be constructed once, it can be used repeatedly, the computational complexity is not high, the output is easy to understand and robust, but it may cause overfitting problems.

In general, a decision tree contains a root node, several internal nodes, and several leaf nodes. Leaf nodes correspond to the decision results. Each other point corresponds to an attribute test. Each branch represents a test output. The root node Contains the full set of samples.

The decision-making process of the decision tree starts from the root node and recursively tests the corresponding characteristic attributes in the splitting items until the leaf nodes, and the path corresponds to a decision test sequence. Depending on the type of feature attribute, there are different ways to build a decision tree:

Attributes are discrete values, and do not require a binary decision tree to be generated, in which case one attribute is a branch;

Attributes are discrete values, and it is required to generate a binary tree. At this time, a subset of the attribute partitions is used for testing, which is divided into two branches according to the subset belonging to this subset and not belonging to this subset.

The attribute is a continuous value, and a value can be determined as a split point split_point, and two branches are generated according to greater than split_point and less than or equal to split_point.

The decision tree is divided into two categories: a classification tree and a regression tree. The former is used to classify tag values, and the latter is used to predict continuous values. Commonly used algorithms include ID3, C4.5, and CART.

The classification tree uses the information gain, information gain rate, and Gini impurity to evaluate the effect of numbers. The prediction category is generally the category with the highest probability in the leaf nodes.

In the regression tree, the predicted value of the leaf node is generally the average value of all the values in the leaf node as the predicted value of the current leaf node, and the mean squared error MSE is generally used as the evaluation index of the tree.

Decision tree split property selection

The decision tree algorithm is a kind of ‘greedy’ algorithm strategy. It only considers the best segmentation method in the current data feature and cannot perform backtracking operations.

For the whole data set, the division operation is performed according to all the characteristic attributes, the 'purity' of the result set of all the division operations is compared, and the feature attribute with higher purity is selected as the data set that needs to be divided to perform the segmentation operation. This operation until the stop condition is reached.

Decision Tree Algorithm Stop Condition

The construction process of the decision tree is a recursive process, so the stop condition must be given, otherwise the process will not stop, generally the following stop conditions can be set:

1) Maximum depth max_depth greater than the set decision tree

2) The minimum number of samples required to subdivide internal nodes that are set

3) Minimum number of leaf nodes less than the set

4) The maximum number of leaf nodes greater than the set

5) less than the set of node partition impurity (or the current node contains samples belong to the same category)

Third, decision tree quantitative purity and model assessment

The purity of the decision tree can be measured using three criteria, namely Gini impurity, entropy, and classification error.

1Gini impurity:

It refers to randomly applying some result from the set to the set. The expected error rate of a certain data item is a measure of the prediction of the degree of confounding. Simply put, children are randomly selected from a data set and the probability of being misclassified into other groups is measured.



2Information entropy:



3Classification error:



The larger the value of these three formulas, the more impure the data is; on the other hand, the smaller the value, the more pure it is. The effect of these three formulas is similar, and entropy formulas are generally used.

Comparison of three purity metrics

Information gain

After calculating the quantized purity value of each feature attribute, use the information gain degree to select the segmentation feature attribute of the current data set; if the value of the information gain degree is larger, the greater the purity that will be lost in the feature attribute, then the attribute The more it should be on the top of the decision tree, the formula is:

Gain=Δ=H(D)-H(D|A)

Gain is the information gain of the training data set D characterized by A, which is the difference between the empirical entropy H(D) of the set D and the empirical conditional entropy H(D|A) of the characteristic A under the given condition.

effect evaluation

Similar to the general classification algorithm, the confusion matrix is used to calculate the accuracy, recall, accuracy, and R2 values.

You can also use the sum of the purity values of the leaf nodes to evaluate the effectiveness of the algorithm. The smaller the value, the better the effect.



Fourth, decision tree generation algorithm

1, ID3 algorithm

The classic construction algorithm of the decision tree uses information entropy to construct according to the principle of maximum information gain. Each iteration chooses the feature attribute with the largest information gain as the segmentation attribute. Once a feature is segmented, the feature will be used in the subsequent algorithm execution. No longer works. An ID3 decision tree can have multiple branches, but it cannot handle cases where the feature value is continuous.

Assuming that the proportion of the i-th sample in the current sample set D is pi (k=1, 2, ..., γ|), the information entropy of D is:



Suppose that the discrete attribute a has V possible values {a1, a2,..., aV}. If a sample set D is divided using a, then V branch nodes will be generated, where the vth branch node contains D All samples in attribute a that have the value av, denoted as Dv



Advantages and disadvantages of ID3 algorithm

Advantages: The decision tree is fast to build and simple to implement.

Disadvantages: Calculations rely on features with a large number of features, and attributes with the most attribute values are not necessarily optimal; ID3 is not an incremental algorithm; ID3 algorithm is a univariate decision tree and does not consider the relationship between feature attributes; noise immunity Poor sex; only suitable for small-scale data sets, need to put data into memory.

2, C4.5 algorithm

The C4.5 algorithm is improved on the basis of the ID3 algorithm. The information gain rate is used as the criterion of the selection branch. During the construction of the decision tree, the pruning operation is optimized to handle the problem that the feature attribute is a continuous value.



Where IV(a) is the intrinsic value of a, the greater the number of possible values of a, the larger the value of IV(a).

According to Zhou Zhihua's "Machine Learning", the gain rate criterion has a preference for a small number of attributes. Therefore, the C4.5 algorithm does not directly select the candidate partition attribute with the largest gain rate, but uses a heuristic method. : First find the attribute with higher information gain than the average from the candidate division attributes, and then select the attribute with the highest gain rate.

The advantages and disadvantages of the C4.5 algorithm

Advantages: The rules produced are easy to understand, the accuracy of the model is high, and the implementation is simple.

Disadvantages: The dataset needs to be scanned and sorted several times in sequence, which has low efficiency. It is only suitable for small-scale datasets and needs to put data into memory.

3, CART algorithm (Classification And Regression Tree)

Categorical Regression Tree is a binary tree that uses binary partitioning every time

PREVIOUS：The principle of machine vision technolo NEXT：OCR noise detection in printed text