Can Q-learning (and SARSA) be directly used in a Partially Observable Markov Decision Process (POMDP)? If not, why not? My intuition is that the policies learned will be terrible because of partial observability. Are there ways to transform these algorithms so that they can be easily used in a POMDP?
1 Answers
The usual (as presented in Reinforcement Learning: An Introduction) $Q$-learning and SARSA algorithms use (and update) a function of a state $s$ and action $a$, $Q(s, a)$. These algorithms assume that the current state $s$ is known. However, in POMDP, at each time step, the agent does not know the current state, but it maintains a "belief" (which, mathematically, is represented as a probability distribution) in what the current state might be, so it cannot maintain (an approximation of) the function $Q(s, a)$. Hence, the usual Q-learning and SARSA algorithms shouldn't be directly applicable to a POMDP.
However, $Q$-learning is often used in the contexts where observations emitted by the environment (or transformations of the raw observations) are used to build the current state (which is assumed to be Markov, even if it is not). For example, in the original DQN, the action taken at the current step and the raw observation and the reward emitted by the environment (after this action is taken) are combined to produce the current (Markov) state. It might not be the case that the way they combine the action, the reward and the observation is sufficient to fully describe the current state (which might not even be Markov).
In this report, Deep Reinforcement Learning with POMDPs, the author attempts to use Q-learning in a POMDP setting. He suggests to represent a function, either $Q(b, a)$ or $Q(h, a)$, where $b$ is the "belief" over the states and $h$ the history of previously executed actions, using neural networks. So, the resulting parameterized functions would be denoted by $Q(b, a; \theta)$ or $Q(h, a; \theta)$, where $\theta$ is a vector representing the parameters of the corresponding neural network. Essentially, the author uses a DQN (with an experience replay buffer and target network), but the results are not great: $Q$ values converge, but policies do not and they are not robust (in that they are sensitive to small perturbations).
- 42,615
- 12
- 119
- 217