23

Many examples work with a table-based method for Q-learning. This may be suitable for a discrete state (observation) or action space, like a robot in a grid world, but is there a way to use Q-learning for continuous spaces like the control of a pendulum?

nbro
  • 42,615
  • 12
  • 119
  • 217
Bryan McGill
  • 491
  • 1
  • 3
  • 12

2 Answers2

14

Q-learning for continuous state spaces

Yes, this is possible, provided you use some mechanism of approximation. One approach is to discretise the state space, and that doesn't have to reduce the space to a small number of states. Provided you can sample and update enough times, then a few million states is not a major problem.

However, with large state spaces it is more common to use some form of function approximation for the action value. This is often noted $\hat{q}(s,a,\theta)$ to show that it is both an estimate (the circumflex over $\hat{q}$) and that you are learning some function parameters ($\theta$). There are broadly two popular approaches to Q-learning using function approximation:

  • Linear function approximation over a processed version of the state into features. A lot of variations to generate features have been proposed and tested, including Fourier series, tile coding, radial basic functions. The advantage of these methods are that they are simple, and more robust than non-linear function approximations. Which one to choose depends on what you state space represents and how the value function is likely to vary depending on location within the state space.

  • Neural network function approximation. This is essentially what Deep Q Networks (DQN) are. Provided you have a Markov state description, you scale it to work sensibly with neural networks, and you follow other DQN best practices (experience replay table, slow changing target network) this can work well.

Q-learning for continuous action spaces

Unless you discretise the action space, then this becomes very unwieldy.

The problem is that, given $s,a,r,s'$, Q-learning needs to evaluate the TD target:

$$Q_{target}(s,a) = r + \gamma \text{max}_{a'} \hat{q}(s',a',\theta)$$

The process for evaluating the maximum becomes less efficient and less accurate the larger the space that it needs to check.

For somewhat large action spaces, using double Q-learning can help (with two estimates of Q, one to pick the target action, the other to estimate its value, which you alternate between on different steps) - this helps avoid maximisation bias where picking an action because it has the highest value and then using that highest value in calculations leads to over-estimating value.

For very large or continuous action spaces, it is not usually practical to check all values. The alternative to Q-learning in this case is to use a policy gradient method such as Actor-Critic which can cope with very large or continuous action spaces, and does not rely on maximising over all possible actions in order to enact or evaluate a policy.

Controlling a pendulum

For a discrete action space e.g. applying one of a choice of forces on each time step, then this can be done using a DQN approach or any other function approximation. The classic example here might be an environment like Open AI's CartPole-v1 where the state space is continuous, but there are only two possible actions. This can be solved easily using DQN, it is something of a beginner's problem.

Adding continuous action space ends up with something like the Pendulum-v0 environment. This can be solved to some degree using DQN and discretising the action space (to e.g. 9 different actions). However, it is possible to make more optimal solutions using an Actor-Critic algorithm like A3C.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
2

Q-Learning for continuous state space

Reinforcement learning algorithms (e.g Q-Learning) can be applied to both discrete and continuous spaces. If you understand how it works in discrete mode, then you can easily move to continuous mode. That's why in the literature all the introductory material focuses on discrete mode, as it's easier to model (table, grid, etc.)

Supposing you have a discrete number of actions, the only difference in a continuous space is that you will be modeling the state each $X$ amount of time ($X$ being a number you can choose depending on your use case). So, basically, you end up with a discrete space, but probably with an infinite number of states. You apply then the same approach you learned for discrete mode.

Let's take the example of self-driving cars, at each $X$ms (e.g $X=1$), you'll be computing the state of the car which are your input features (e.g direction, orientation, rotation, distance to the pavement, relative position on the lane, etc.) and take a decision of the action to take as in discrete mode. The approach is the same in other use cases, like playing games, walking robots, and so on.

Note (continuous action space):

If you have continuous actions, then in almost all use cases the best approach is to discretize your actions. I can't think of an example where discretizing your actions will lead to a considerable deficiency.

nbro
  • 42,615
  • 12
  • 119
  • 217
HLeb
  • 599
  • 5
  • 10