4

I've been doing some research on the principles behind AlphaZero. Especially this cheat sheet (1) and this implementation (2) (in Connect 4) were very useful.

Yet, I still have two important questions:

  1. How is the policy network updated? In (2), board positions are saved in a dataset of tuples (state, policy, value). The value is derived from the result of the self-played game. However, I'm not sure which policy is saved: the number of times that each move has been played, the prior probabilities for each move (I guess not), or something else?

  2. The cheat sheet says that (for competitive play) the move is chosen with the greatest N (=most visited). Wouldn't it be more logical to choose the move with the highest probability calculated by the policy head?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

1

However, I’m not sure which policy is saved

The policy from the Monte Carlo tree search is stored, as we can get the policy estimate from the network later by passing the given state through the network, which is used to calculate the cross entropy loss to update the network's policy (summed with Mean squared error loss between value head's prediction and the actual value/reward).

Wouldn’t it be more logical to choose the move with the highest probability calculated by the policy head?

Depends on the number of searches you've performed, after thousands of simulations the MCTS would give better results, as it approximates the minimax tree.