3

The first step of MCTS is to keep choosing nodes based on Upper Confidence Bound applied to trees (UCT) until it reaches a leaf node where UCT is defined as

$$\frac{w_i}{n_i}+c\sqrt{\frac{ln(t)}{n_i}},$$

where

  • $w_i$= number of wins after i-th move
  • $n_i$ = number of simulations after the i-th move
  • $c$ = exploration parameter (theoretically equal to $\sqrt{2}$)
  • $t$ = total number of simulations for the parent node

I don't really understand how this equation avoids sibling nodes being starved, aka not explored. Because, let's say you have 3 nodes, and 1 we'll call it node A is chosen randomly to be explored, and just so happens to simulate a win. So, node A's UCT$=1+\sqrt(2)\sqrt{\frac{ln(1)}{1}}$, while the other 2 nodes UCT = 0, because they are unexplored and the game just started, so by UCT the other 2 nodes will never be explored no? Because after this it'll go into the expansion phase and expansion only happens it reaches a leaf node in the graph. So because node A is the only one with a UCT $> 0$ it'll choose a child of node A and it will keep going down that node cause all the siblings of node A have a UCT of 0 so they never get explored.

nbro
  • 42,615
  • 12
  • 119
  • 217
user8714896
  • 825
  • 1
  • 9
  • 24

1 Answers1

2

First explore the nodes A,B,C once.

For reference see this paper by David Silver and Sylvain Gelly, Combining Online and Offline Knowledge in UCT

If any action from the current state $s$ is not represented in the tree, $\exists a \in \mathcal{A}(s),(s, a) \notin \mathcal{T},$ then the uniform random policy $\pi_{\text {random }}$ is used to select an action from all unrepresented actions, $\tilde{\mathcal{A}}(s)=\{a \mid(s, a) \notin \mathcal{T}\}$.

nbro
  • 42,615
  • 12
  • 119
  • 217
Cohensius
  • 423
  • 3
  • 15