Why does Monte Carlo work when a real opponent's behavior may not be random

Question

I am learning about Monte Carlo algorithms and struggling to understand the following:

If simulations are based on random moves, how can the modeling of the opponent's behavior work well?

For example, if I have a node with 100 children, 99 of which lead to an instant WIN, whereas the last one leads to an instant LOSS.

In reality, the opponent would never play any of the 99 losing moves for him (assuming they are obvious as they are the last moves), and would always play the winning one. But the Monte Carlo algorithm would still see this node as extremely favorable (99/100 wins for me), because it sees each of the 100 moves as equally probable.

Is my understanding wrong, or does it mean that in most games such situations do not occur and randomness is a good approximation of opponent behavior?

Dennis Soemers · Accepted Answer · 2018-08-21T08:33:55.060

First, we need to distinguish plain Monte-Carlo from Monte-Carlo Tree Search. They're different things.

Monte-Carlo search, in the context of game AI search algorithms, is typically understood to mean that we search randomly many times, and average the results, and nothing else. If this is all we're doing, then yes, your understanding is correct. This is also sometimes referred to as "plain Monte-Carlo (search)" or "pure Monte-Carlo (search)", to make it explicitly clear that we're not doing any tree search as in Monte-Carlo Tree Search (sometimes when we just say "Monte-Carlo" in the context of game AI, people will automatically assume Monte-Carlo Tree Search, due to how popular it is).

Monte-Carlo Tree Search does a lot more than just that though. It gradually builds up a search tree (through the Expansion step), and within that search tree (which is growing over time), it uses a much more sophsticated strategy for traversal than pure random (the Selection step).

For example suppose i have a node with 100 children, 99 of which leading to an instant WIN, and the last one leading to an instant LOSS.

Suppose that this node you're talking about, the one with those 100 children, is relatively close to the root node. Then, it is likely that it and all 100 of its children will end up "growing into the search tree" that is slowly built up by the Expansion step. Once they have been added to the search tree, the Selection step will make sure that the vast majority of further iterations visiting this part of the tree will select the instant loss (assuming the opponent is to move in this node). In the limit (after an infinite amount of search time), this bias towards selecting the loss node will be so large that the average evaluation tends to the (correct) minimax evaluation.

Another way to view the idea of evaluation through many random sequences of play is the following; the idea is that if we're in a very strong position, a good game state, then we're more likely to win than to lose if both players start playing at random. Consider, for example, a game of chess where Player 1 has many pieces left, and Player 2 only has a few pieces left. Imagine both players were to play completely at random from that game state onwards. On average, which player would you expect to win more often? Probably Player 1.

When we're considering game states that are still far away from terminal states, this basic idea tends to work relatively well. Obviously not always correct, it's still a heuristic, but it can kind of work. When we're already very close to a terminal state that can be reached through one highly specific sequence of play, yeah, we might miss that through random actions; this is where we need the more informed policy of the Selection step.

Douglas Daseeco · Answer 2 · 2018-08-04T01:24:01.653

The Two Questions

Why does Monte Carlo work when a real opponent's behavior may not be random?
If simulations are based on random moves, how can the modeling of the opponent's behavior work well?

Directed Graphs Over Trees

Games (or game-like strategic scenarios) should not be represented as trees. If the process paths being represented have the Markov property in that each decision lacks knowledge of history, a particular game state can be approached by more than one path, and there may be cyclic paths where a game state is revisited. Trees have neither feature.

It is best to use directed graph structures to think about these problems. The state of a game is a vertex and vertices are connected by unidirectional edges. This is normally drawn as shapes connected by arrows. When two arrows enter one shape or there is a closed path, it is not a tree.

The Scenario Outlined in the Question

In the case of the scenario outline in this question, there is a vertex representing a game state with 100 outgoing edges representing possible moves for player A. Ninety-nine of the edges lead to an obvious instant game win for A. Exactly one leads to an obvious instant game win for B.

Playing back the game to prior to the traversal of the incoming edge to the vertex before the final move, it cannot be assumed that game play allowed player B the same 100 options. Even if the same 100 were available to B, they would not necessarily be of similar value from B's perspective when deciding that previous move. More than likely, B will have had a different set of outgoing edges from which to choose, bearing little or no obvious resemblance to A's subsequent options.

Any game where this is not true, where the options remain constant, would be trivial even in comparison with tic-tack-toe.

The Monte Carlo Approach and Its Algorithm Development

Regarding the specification of a singular Monte Carlo algorithm, it does not exist. Goodfellow, Bengio, & Courville correctly state in their Deep Learning, 2016, that Monte Carlo algorithms (not a singular algorithm) draw a normally correct conclusion but with a non-deterministic occurrence of incorrect conclusion. There are varieties of approach details and associated algorithms in the literature.

Cross-entropy (CE) method proposed by Rubinstein in 1997
Continuation multilevel Monte Carlo algorithm; Collier, Haji–Ali, von Schwerin, & Tempone; 2000
Sequential Monte Carlo algorithm; Drovandi, McGree, & Pettitt; 2012
Distributed consensus approach from Bayes and Big Data: The Consensus Monte Carlo Algorithm; Scott1, Blocker, Bonassi, Chipman, George, & McCulloch; 2014
Hamiltonian Monte Carlo, a Markov chain based algorithm designed to avoid, "The random walk behavior and sensitivity to correlated parameters;" Hoffman & Gelman; 2014

There are several more. All attempt to use chaotic perturbation to minimize duration and resource consumption of decisioning by approximating a Monte Carlo simulation from a Bayesian posterior distribution.

The simulation of stochastic nature is usually, in these approaches, accomplished by the injection of a chaotic sequence from a pseudo random number generator. They are generally not truly stochastic because acquiring entropy from within a digital system is another bottleneck presenting immense difficulties, but that's an entirely tangential topic.

Direct Answer to the Question

To correct the misconception in the question, this use of chaotic perturbation does not equalize the selection of moves (represented by edges in the game-play's directed graph). The probabilities of success for each available option are still roughly calculated and followed, but only roughly so because of the psuedo-noise injected by design.

These disturbances in the application of pure optimization achieve time and resource thrift for the majority of game states (represented by vertices) but concurrently sacrifice some reliability.

An Overview of Why the Sacrifice Works

The introduction of chaotic perturbations, mentioned above, modifies the conditions of the optimization search through the achievement of two very specific gains.

Faster coverage of the contour being searched by increasing entropy (being less organized by adding synthetic Brownian motion) across the set of trials.
Avoidance of local minima in convergence by being less presumptuous about the contour being searched (slightly less reliant on gradient and curvature hints).

This is true of both reinforced networks (containing real time feedback during actual use) or pre-trained networks of the supervised training type (with labelled data) or unsupervised training where convergence is determined by fixed criteria.

Andrew Butler · Answer 3 · 2018-06-29T15:38:49.243

I will point out that the Monte Carlo Tree Search algorithm does not make completely random moves. Instead it usually uses some metric to balance between exploration and exploitation when deciding which branch to search (see Upper Confidence Bound and others).

That being said, you are correct in that a specific line of play which is incredibly troublesome may not be seen and could cause Monte Carlo to make a major mistake. This may have been a cause of AlphaGo losing to Lee Sedol in game 4.

A disadvantage is that, faced in a game with an expert player, there may be a single branch which leads to a loss. Because this is not easily found at random, the search may not "see" it and will not take it into account. It is believed that this may have been part of the reason for AlphaGo's loss in its fourth game against Lee Sedol. In essence, the search attempts to prune sequences which are less relevant. In some cases, a play can lead to a very specific line of play which is significant, but which is overlooked when the tree is pruned, and this outcome is therefore "off the search radar"

Why does Monte Carlo work when a real opponent's behavior may not be random

3 Answers3