1

I am currently training a self-playing Monte-Carlo-Tree-Search (MCTS) algorithm with a neural network prior, and it seems to be working pretty well. However one problem I have is when I compare my new iteration of the player against the previous version to see whether the new one is an improvement over the previous one.

Ideally I want to compare the two players to play 20 games of Tic Tac Toe with each being the first player in 10 of them. But what ends up happening is that each of those 10 games play out identically (because the MCTS in each player is reset at the beginning of each game, and since they are playing to win, they both take the play with highest probability, rather than randomly drawing actions based on the probabilities, so each player is making exactly the same decisions as they did in the previous game).

So I understand why this isn't working, however I'm not sure what people commonly do to fix this problem? I could choose to not reset the MCTS between each game?, but that also feels like a weird fix, since the players are then still learning as the games are played, and game 10 would be quite different from game 1, but maybe that is just how people normally do this?

Tue
  • 111
  • 1

3 Answers3

1

If I understand your question correctly, your goal is to figure out whether or not the quality of your neural network is improving as training progresses.

The core issue seems to be that you do this by evaluating the playing strength of MCTS+DNN agents, but... the game of Tic-Tac-Toe is so simple, that an MCTS agent even with a completely random DNN (or no DNN at all) is likely already capable of optimal play. So, you cannot measure any improvement in playing strength.

I would suggest one of the following two solutions:

  1. Instead of evaluating the playing strength of an MCTS+DNN agent, just evaluate the playing strength of a raw DNN without any search. You could simply make this agent play according to the policy head of the network (proportionally or by maximising over its outputs), and ignore the value head. A pure MCTS (without neural network) could be used as the opponent to evaluate against (or you could evaluate against past versions of your DNN).

  2. If you do want to continue evaluating MCTS+DNN agents, you could try to severely constrain the amount of computation the MCTS part is allowed to use, such that the MCTS by itself can no longer guarantee optimal play. For example, you could let your MCTS run only, say, 10 or 20 iterations for every root state encountered.

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70
0

MCTS bandit phase chooses an action via UCB(1) algorithm. For a given $Q(s,a)$ and the visit counts of non-leaf node, it chooses an action branch in the tree to descend via:

$$ a = \arg \max_{a'}\left[Q(s,a') + c \sqrt{\frac{\log N}{N(s,a')}}\right]$$

The $c$ parameter is the exploration temperature. If $c \gg Q$ then the second term always wins and the algorithm should be randomly exploring the whole game tree, without any value-based preference. So, one possible problem could be that your $c$ is too low.

Another possible problem is a pretty common misconception about the way MCTS works. It could be that for a single actual in-game move you are performing one MCTS search step. That's not how it is supposed to work - MCTS can be considered as an online "improvement" algorithm for offline learned value function - one should be performing multiple MCTS searches and build a reasonably large search tree for every in-game move one performs. Quoting Wikipedia:

Rounds of search are repeated as long as the time allotted to a move remains. Then the move with the most simulations made (i.e. the highest denominator) is chosen as the final answer.

Kostya
  • 2,667
  • 12
  • 24
0

I have to answer since I can't comment yet. I have the exact same problem, all evaluation games make the same moves again and again.

I currently pick a random first action for both players, but it feels like there should be a better solution. I guess I could reuse the MCTS but that seems like an even worse hack.

PWE
  • 1