2

I'm working on reimplementing the MuZero paper. In the description of the MCTS (page 12), they indicate that a new node with associated state $s$ is to be initialized with $Q(s,a) = 0$, $N(s,a) = 0$ and $P(s,a) = p_a$. From this, I understand that the root node with state $s_0$ will have edges with zero visits each, zero value and policy evaluated on $s_0$ by the prediction network.

So far so good. Then they explain how actions are selected, according to the equation (also on page 12):

enter image description here

But for the very first action (from the root node) this will give a vector of zeros as argument to the argmax: $Q(s_0,a) = 0$ and $\sum_bN(s_0,b)=0$, so even though $P(s_0,a)$ is not zero, it will be multiplied by a zero weight.

Surely there is a mistake somewhere? Or is it that the very first action is uniformly random?

Ziofil
  • 128
  • 7

1 Answers1

1

Yeah, it seems that you're right and based on the description of the paper it would indeed behave uniformly random at the very first iteration (or maybe just always deterministically pick whichever action happens to be the first one in the list). I can't find anything that would suggest otherwise in the paper, and also the pseudocode they put on arXiv suggests the same behaviour.

I can't really think of any reason why this would be better than playing according to $P(s, a)$ in the very first iteration though. Maybe it creates a tiny little bit more variety in the tree search results you get? I suppose all the stochasticity that is normally inherent to a vanilla MCTS is no longer present in MuZero (or AlphaZero), because they always run for exactly the same number of iterations, and don't have any sort of random rollouts anymore; this would at least introduce a tiny bit of variation.

On the other hand, I also can't imagine this choice really making a meaningful difference in practice, except maybe if you go down to extremely low iteration counts (like if you run fewer iterations than there are legal actions in the root node).

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70