2

I was following a reinforcement learning course on coursera and in this video at 2:57 the instructor says

Expected SARSA and SARSA both allow us to learn an optimal $\epsilon$-soft policy, but, Q-learning does not

From what I understand, SARSA and Q-learning both give us an estimate of the optimal action-value function. SARSA does this on-policy with an epsilon-greedy policy, for example, whereas the action-values from the Q-learning algorithm are for a deterministic policy, which is always greedy.

But, can't we use these action values generated by the Q-learning algorithm to form an $\epsilon$-greedy policy?

We can, for instance, in each state, give the maximum probability to the action with the greatest action-value and the rest of the actions can have probability $\frac{\epsilon}{\text{number of actions}}$. Because we do a similar thing with SARSA, where we infer the policy from the current estimate of action-values after each update.

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

8

If we assume a tabular setting, then Q-learning converges to the optimal state-action value function, from which an optimal policy can be derived, provided a few conditions are met.

In finite MDPs, there's at least one optimal (stationary) deterministic policy, but there can also be optimal stochastic policies - specifically, if two or more actions have the same optimal value, then you can stochastically choose between them. However, if you stochastically choose between all actions (including non-optimal ones), then you will not behave optimally.

SARSA also converges to the optimal state-action value function, but the learning policy must eventually become greedy. See this post and theorem 1 (p. 294) of this paper.

So, even in SARSA, if you want to behave optimally, you can't just arbitrarily choose any stochastic policy derived from this found optimal value function (note also that SARSA is on-policy). However, SARSA can also find an optimal restricted policy. See theorem 2 (p. 297) of this paper for more details.

To answer your question directly, Q-learning can find an optimal stochastic policy, provided that it exists.

nbro
  • 42,615
  • 12
  • 119
  • 217