News classification
Contact us
- Add: No. 9, North Fourth Ring Road, Haidian District, Beijing. It mainly includes face recognition, living detection, ID card recognition, bank card recognition, business card recognition, license plate recognition, OCR recognition, and intelligent recognition technology.
- Tel: 13146317170 廖经理
- Fax:
- Email: 398017534@qq.com
Fast understanding of AlphaGo algorithm
Fast understanding of AlphaGo algorithm
Weiqi is a complete problem of information game. The complete information game is usually simplified as a tree search problem seeking the optimal value. It contains a B D Party A may branch in chess, B = 35, d = 80; while on the go in B is about 250, d = 150. It is obvious that the use of the exhaustive or simple heuristics algorithm is not feasible for Weiqi. But the effective way is to exist.
Reducing search depth through position evaluation
The strategy and value are separated by the Monte Carlo tree search (MCTS).
The usual steps are:
Using a 13 - layer CNN, a monitoring learning strategy network P Sigma is exercised directly from the human chess step. The input image of 48 x 19 x 19 (for example, it consists of 3 pieces of color is x 19 x 19), the output is the probability of using all the Lazi softmax reservoir prediction. The accuracy is 55.7%.
Exercise a fast strategy P PI that can quickly sample action at the time of operation. This will use a linear softmax based on small formal features. The accuracy is 24.2%, but it only calculated once Lazi 2 microseconds, whereas P sigma needs 3 ms.
Exercise a reinforcement learning strategy network P P, after optimization of the game results to further enhance the strategic monitoring network. This is to optimize the strategic network to win the game, not to optimize the accuracy of the prediction. In essence, structure of P P and P Sigma is the same. They use the same weight value p = Sigma initialization. Two players, is the current strategic network P P and random (overfitting) prior to the selection of iterative strategy network.
Exercise a value network (value network) V theta to predict the winner of the intensive learning strategy network myself and my chess player. The network structure is similar to the strategic network, but it has more than one characteristic plane (the current player's color), and the output has turned into a single prediction (regression, mean variance loss). According to the prediction results of chess game intact, easily lead to overfitting. This is due to the high correlation between successive Lazi position, as long as a son of the poor. Therefore, this is the use of the intensive learning strategy network myself and myself to regenerate the data. The data collected from 30 million different locations including independent game.
The strategic network, the value network, the fast strategy and the Monte Carlo tree search are separated. A canonical Monte Carlo tree search process consists of four steps: selection, extension, evaluation, and backup. To make it easier for you to understand, we have roughly talked about how it imitates the part of the selected state (for example, interest in mathematics, please find the formula in the original paper).
State score = value network output + fast running (fast rollout) strategic results + monitoring learning strategy network output
High score (or sub will be selected). The result of value network output and fast running strategy is evaluation function, and stop evaluation at leaf nodes. Monitor the output probability of action network learning strategy of a current stage, for the selected fraction of bonus points. The score degenerates with the number of visits to encourage inquiry. It is noted that the reinforcement learning strategy network is used only to assist, to generate value networks, and has not been used directly in Monte Carlo tree search.
This is the end of this, and the above is the AlphaGo algorithm that defeats the human race.