0

I have a problem with applying alpha zero self-play to a game (Connect 6) with a huge branching factor (30,000 on average).

I have implemented the MCTS as described but I found that during the MCTS simulations for the first move, because P(s, a) is so much smaller than Q(s, a), the MCTS search tree is extremely narrow (in fact, it only had 1 branch per level, and later I have added dirichlet noise and it changed the tree to have 2 child branches instead of 1, per level).

With some debugging I figured that after the first simulation, the last visited child would have a Q value which is on average 0.5, and all the rest of the children would have 0 for their Q values (because they haven't been explored) while their U are way smaller, about 1/60000 on average. So comparing all the Q + U on the 2nd and subsequent simulations would result in selecting the same node over and over again, building this narrow tree.

To make matter worse, because the first simulation built a very narrow tree, and we don't clear the statistics on the subsequent simulations for the next move, the end result is the simulations for the first move dictated the whole self-play for the next X moves or so (where X is number of simulations per move) - this is because the N values on this path is accumulated from the previous simulations from the prior moves. Imagine I run 800 simulations per move but the N value on the only child inherited from previous simulations is > 800 when I started the simulation for the 3rd move in the game.

This is a basically related to question 2 raised in this thread:

AlphaGo Zero: does $Q(s_t, a)$ dominate $U(s_t, a)$ in difficult game states?

I don't think the answer addressed the problem I am having here. Surely we are not comparing Q and U but when Q dominates U then we ended up building a narrow and deep tree and the accumulated N values are set on this single path, preventing the self-play to explore meaningful moves.

At the end of the game these N values on the root nodes of moves played ended up training the policy network, reinforcing these even more on the next episode of training.

What am I doing wrong here? Or is AlphaZero algorithm not well suited for games with branching factor in this order?

Note: as an experiment I have tried to use a variable C PUCT value, which is proportional to the number of legal moves on each game state. However this doesn't seem to help either.

javaPhobic
  • 103
  • 3

1 Answers1

1

There are a few things you can do to address this

(1) Setting Q correctly on unexplored nodes

With some debugging I figured that after the first simulation, the last visited child would have a Q value which is on average 0.5, and all the rest of the children would have 0 for their Q values

This already points out a bug, as the rest of the children should start with the "average Q-value". Remember that Q is just our guess for the EV of an action. Your comment says that you set it to "0.5", but this still isn't quite right. You'll want to modify this by initializing it to the "average value" of a "random move" in that position.

For Chess and Go, this is quite close to 0 from most positions, since most valid moves are terrible. For your game, it's possible that this value is higher, if most moves are "reasonable". But, it should still be set to the "Average value of a random move in that particular position", not simply 0.5.

When your network is very weak, its choice will have a value of approximately "the average value of a random move", so by setting the initial Q correctly, it will cause the difference in Q between two different moves to be very small, letting the "U" term dominate initial exploration. To be precise, the difference in Q will tell you How much better you current path is, over just picking some other random move, which is exactly what you want the difference in Q to be.

See Lc0's "FPU / First Play Urgency" configuration, which sets unvisited node's initial value to be 0.44 less than the parents' evaluation.

  • Theoretically, if you have high confidence in your NN, you can lower FPU for increased efficiency.

(2) Increase $C_{puct}$ with respect to the number of paths down that edge.

In the face of extreme depth down a particular path, $C_{puct}$ should begin to increase to at least "double-check" the moves we've been ignoring thusfar. AlphaZero already does this:


enter image description here


With their numbers, $C_{puct}$ starts at 2.5, but rises to 2.8 when Depth = 20k, and 2.97 when Depth = 40k. ($2.5 + \log(2)$, and $2.5 + \log(3)$, respectively). You'll have to adjust this manually, often in relation to your simulations-per-move.

(3) Increase Dirichlet Noise

I'm not sure what $\alpha$ you used, but based on AlphaZero, $\alpha = 10/60000 =$ 1.66e-4 is what you should be aiming for, if you want to match AlphaZero's configuration of 10/AverageValidMoves.