If AlphaZero was always playing the best moves it would just generate the same training game over and over again. So where does the randomness come from? When does it decide not to play the most optimal move?
1 Answers
During the self-play training process, AlphaZero does not greedily play only the moves it thinks are "best" (which would normally be the move with the highest visit count leading out of the root node of the MCTS search tree). Instead, for the purpose of generating a more diverse set of experience, it samples moves proportionally to the visit counts. This means that in any given situation encountered during self-play, the move considered to be "optimal" will still have the largest probability of getting picked, but other moves will also have smaller probabilities of getting picked. In theory, it might even sometimes pick the move that it expects to be the worst one (very rarely)!
If I recall correctly, they only did what I describe above for the first 30 moves of any game, and afterwards move on to greedy play. This still results in a very diverse set of 30-move-starts for every game it experiences though. I'm not 100% sure if I remember this detail correctly though, maybe they only did this earlier (in AlphaGo Zero for example), and no longer do it in AlphaZero. Would have to check the paper to make sure.
Additionally, whenever they start a new search process, they perturb the prior probabilities assigned by the learned policy network to all the moves available in the root node. This is done in a non-deterministic way using Dirichlet noise. I think this is not really explicitly mentioned in the AlphaZero paper, but it is in the Supplementary Materials (and also in the AlphaGo Zero paper?). Anyway, this also means that if precisely the same game state is encountered twice in two different games of self-play, the search behaviour may be slightly different due to the introduced stochasticity, and hence it may come to a different conclusion as to what move is "optimal".
- 10,519
- 2
- 29
- 70