What is this algorithm? Is it a variant of Monte-Carlo Tree Search?

Question

I'm using a Neural Network as an agent in a simple car racing game. My goal is to train the network to imitate a brute-force tree search to an arbitrary depth.

My algorithm goes something like the following:

The agent starts with depth 0
Generate a bunch of test states
For each test state s:
- For each action a:
  - s' = env.step(s, a)
  - for x in range(agent.depth): s' = env.step(s', agent.predict(s'))
  - calculate reward at state s'
- Set the label for test state s as whichever action a produced the highest reward
Train the agent using the test states and labels, and increment agent.depth
Loop until desired depth

The idea is that an agent trained in this way to depth N should produce output close to a brute-force tree search to depth N...so by using it to play out N moves, it should be able to find me the best final state at that depth. In practice, I've found that it performs somewhere between N and N-1 (but of course it never reaches 100% accuracy).

My question is: what is the name of this algorithm? When I search for tree search with playout, everything talks about MCTS. But since there's no randomness here (the first step is to try ALL actions), what would this be called instead?

Neil Slater · Accepted Answer · 2021-03-17T21:57:49.583

In short it looks like you have constructed a valid reinforcement learning method, but it does not have much in common with Monte Carlo Tree Search. It may have some weaknesses compared to more established methods, that means it will work better in some environments rather than others.

Your approach may be novel, in that you have combined ideas into a method which has not been used in this exact form before. However, it is following the principles of general policy improvement (GPI):

Estimate action values of a current policy.
Create a new policy by maximising action choices with respect to latest estimates.
Set current policy to new policy and repeat.

Your method covers exploration with a deterministic policy by sweeping through a list of state and action pairs. This resembles policy iteration, or perhaps exploring starts in Monte Carlo control. It evaluates each state/action pair by following the current policy up to a time step horizon. This has some weakness, but it may work well enough in some cases.

The depth iteration is more complex to analyse. It is not doing what you suggest - i.e. it does not make the whole algorithm equivalent somehow to a tree search. Technically the value estimates with a short horizon will be poor, biased estimates of true value in the general case, but on the plus side they may still help differentiate between good and bad action choices in some situations. It may even help as a form of curriculum learning by gathering and training on data about immediately obvious decisions first (again I expect that would be situational, not something you could rely on).

Overall, although you seem to have found a nice solution for your racing game, I think your method will work unreliably in a general case. Environments with stochastic state transitions and rewards would be a major problem, and this is not something you could fix without major changes to the algorithm.

I could suggest that you to try one or more standard published methods, such as DQN, A3C etc, and compare them with your approach on the same environments in different ways. E.g. how much time and computation it takes to train to an acceptable level, or how close to optimal each method can get to.

The main comparison with Monte Carlo Tree Search is that you evaluate each state, action pair with a kind of rollout. There is lots more going on in MCTS that you are not doing though, and the iteration with longer rollouts each time is not like anything in MCTS and does not compensate for missing parts of MCTS in your algorithm.

What is this algorithm? Is it a variant of Monte-Carlo Tree Search?

1 Answers1