5

I have already implemented a relatively simple DQN on Pacman.

Now I would like to clearly understand the difference between a DQN and the techniques used by AlphaGo zero/AlphaZero and I couldn't find a place where the features of both approaches are compared.

Also sometimes, when reading through blogs, I believe different terms might in fact be the same mathematical tool which adds to the difficulty of clearly understanding the differences. For example, variations of DQN e.g. Double DQN also uses two networks like alpha zero.

Has someone a good reference regarding this question ? Be it a book or an online ressource.

nbro
  • 42,615
  • 12
  • 119
  • 217
FenryrMKIII
  • 153
  • 2
  • 6

2 Answers2

6

DQN and AlphaZero do not share much in terms of implementation.

However, they are based on the same Reinforcement Learning (RL) theoretical framework. If you understand terms like MDP, reward, return, value, policy, then these are interchangeable between DQN and AlphaZero. When it comes to implementation, and what each part of the system is doing, then this is less interchangeable. For instance two networks you have read about in AlphaZero are the policy network and value network. Whilst double DQN alternates between two value networks.

Probably the best resource that summarises both DQN and AlphaZero, and explains how they extend the basic RL framework in different ways is Sutton & Barto's Reinforcement Learning: An Introduction (second edition) - Chapter 16 sections 5 and 6 cover the designs of DQN Atari, AlphaGo and AlphaZero in some depth.

In brief:

DQN Atari

  • Is model-free
  • Uses an action value estimator for $Q(s,a)$ values, based on a Convolutional Neural Network (CNN)
  • Uses experience replay and temporarily frozen target network to stabilise learning process
  • Uses a variety of tricks to simplify and standardise the state description and reward structure so that the exact same design and hyperparameters work across multiple games, demonstrating that it is a general learner.

AlphaZero

  • Is model based (although some of the learning is technically model-free, based on samples of play)
  • Uses a policy network (estimating $\pi(a|s)$) and a state value network (estimating $V(s)$), based on CNNs. In practice for efficiency the NN for these share many layers and parameters, so how many "networks" there are depends how you want to count them.
    • The earlier AlphaGo version had 4 separate networks, 3 variations of policy network - used during play at different stages of planning - and one value network.
  • Is designed around self-play
  • Uses Monte Carlo Tree Search (MCTS) as part of estimating returns - MCTS is a planning algorithm critical to AlphaZero's success, and there is no equivalent component in DQN
Neil Slater
  • 33,739
  • 3
  • 47
  • 66
1

You can actually combine AlphaZero-like approach with DQN: A* + DQN

mirror2image
  • 735
  • 7
  • 15