2

Alphago and AlphaGo zero use random play to generate data and use the data to train DNN. "Random play" means that there is a positive probability for AlphaGo to play some suboptimal moves based on the current DNN; this is for exploring and learning purposes (please correct me if my understanding is wrong).

In the real tournament, does AlphaGo still play the random moves? Is the random play feature only used in the training phase?

If AlphaGo does not play a random move in the real competition, then I think AlphaGo is not learning in that competition. Human players do similar "random play": they usually play some random moves or strange moves in minor contests, just to test out new strategies; in major tournaments, they will be more serious and play less unprepared moves.

So, a related and broader question is: does AlphaGo learn from the game it is playing with the human in real-time?

I think the second question is less important because AlphaGo's learning curve is extremely flat compared to humans: AlphaGo learns epsilon from one single game while human can learn a lot.

nbro
  • 42,615
  • 12
  • 119
  • 217
High GPA
  • 173
  • 6

3 Answers3

2

Question 1: I don't think they ran AlphaGo or AlphaGoZero in training mode during tournament matches because the computing power required for this is really large. I don't recall if this is described in the documentary but see this quote from the AlphaZero paper (page 4):

using 5,000 first-generation TPUs (15) to generate self-play games and 64 second-generation TPUs to train the neural networks.

Question 2: From my first answer, I assume the network is run in inference mode, i.e., taking the observation of the environment as input and outputting the action (next move) according to the strongest strategy of the network. Then the AlphaGo does not learn anything as it does not evaluate whether the taken action is better than the strategy because these are the same. The way that the network learns during training/exploration is by comparing the random action that is taken to the best action known by the network.


You are correct in your understanding of self-play and exploration. This is done in RL to explore actions that might not have been taken yet to explore actions that might be better than the best-known strategy.

But during inference, you would want to greedily take the best-known actions because an action taken at random would most likely be way worse.

Lars
  • 189
  • 2
  • 9
1

The core mechanics of AlphaZero during selfplay and real tournament games are the same: something similar to Monte Carlo Tree Search is done but guided by the current neural network instead of random simulations. The network is only doing inference, it's not learning during a tree search. There's a great summary diagram here.

The differences between selfplay and tournaments are:

  • Selfplay tree searches games get some additional (Dirichlet) noise injected to encourage exploration of different moves and positions. This does not happen during tournament games since that would likely make play weaker.

  • During selfplay moves are selected randomly, but weighed according to how good the tree search thinks certain moves are. This means the entire training process spends most of its time looking at good moves, but still occasionally plays bad moves to keep some variety. During tournaments AlphaZero just plays what it thinks is the best move.

  • Selfplay games are kept in a large buffer with millions of positions, on which the next neural network is trained. This is the only point at which actual network training is happening.

  • Selfplay tree searches only look at ~800 nodes per position, which probably only takes a few milliseconds to calculate. Tournament games take a lot longer, 3h per game + 15s per move. This means the tree search visits hundreds of thousands, likely millions of positions for each move, making tournament play a lot stronger than selfplay.

These are the only differences between selfplay and tournament games. One thing to note is that after selfplay/training is done, AlphaZero is just an immutable neural network with some tree search code around it. There is no learning happening during a tournament, or any additional randomness injected into the process. Instead, to ensure a diversity of games, an Opening Book is used.

KarelPeeters
  • 493
  • 2
  • 8
1

AlphaGo/AlphaZero has 3 main sources of randomness during competitive mode:

  1. Move temperature: The MCTS process outputs a probability distribution P over all candidate moves. The agent chooses a move by sampling from the probability distribution proportional to P^(1/T) for some move temperature T. Typical configurations start T at some value near 1 and decay it quickly to 0 over the course of the game (the limit of P^(1/T) as T approaches 0 is the argmax function). This mechanism helps to ensure, for example, that the agent will not always play the same first move as first player.

  2. Symmetries: The game of go has 8-fold symmetry. Before querying the neural network, the agent randomly chooses one of the 8 symmetry transforms of the game and applies it to the board. It then applies the inverse transform to the neural network’s (policy) output. A perfect neural network should exhibit perfect 8-fold symmetry, but in reality the neural network is imperfect and exhibits asymmetrical biases (e.g., it wants to play an atari in board state S, but if you rotate S by 90 degrees, it no longer wants to play an atari). So this operation leads to random deviations in its move choices.

  3. Tree parallelization: MCTS is made multithreaded via a mechanism known as tree parallelization. The move distribution that MCTS outputs is dependent on the order in which the different threads interleave with each other, which is effectively random.

It is worth noting that any human could easily defeat any deterministic AI in any deterministic game. Simply pit two instances of the AI against each other, memorize the deterministic sequence of moves, and then use those exact same moves against a single copy of the AI. Because of this possibility, game playing agents typically are willing to sacrifice a small amount of playing strength for randomness. Move temperature in particular can be justified for this reason. The other two sources of randomness can be seen as byproducts of engineering optimizations.

In other answers/comments, others have correctly noted that single-threaded MCTS is a deterministic algorithm (if executed purely on CPU). Given a game state, it will always output the same move distribution, absent the symmetry mechanism described above.

dshin
  • 161
  • 5