2

In a general DQN framework, if I have an idea of some actions being better than some other actions, is it possible to make the agent select the better actions more often?

nbro
  • 42,615
  • 12
  • 119
  • 217
user3656142
  • 185
  • 5

1 Answers1

4

For single-step Q learning, the behaviour policy can be any stochastic policy without any further adjustment to the update rules.

You don't have to use $\epsilon$-greedy based on current Q function approximation, although that is a common choice because it works well in general cases. However, you should always allow some chance of taking all actions if you want the algorithm to converge - if you fixed things so that bad actions were never taken, the agent would never learn that they had low value.

Probably the simplest way to use your initial idea of best actions is to write a function that returns your assessment of which action to take, and use that with some probability in preference to a completely random choice. At some point you will also want to stop referencing the helper function (unles it is guaranteed perfect) and use some form of standard $\epsilon$-greedy based on current Q values.

I have done similar with a DQN learning to play Connect 4, where the agent would use a look-ahead search function that could see e.g. 7 steps ahead. If that was inconclusive it would use argmax of current Q values. Both these fixed action choices could be replaced, with probability $\epsilon$, with a random action choice to ensure exploration. It worked very well. You could replace the look-ahead search in my example with any function that returned "best" actions for any reason.

There are some other ways you can skew action selection towards better looking action choices. You could look into Boltzmann exploration or upper confidence bounds (UCB) as other ways to create behaviour policies for DQN.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66