3

If I create a policy using the q-values of an epsilon greedy policy using the Sarsa algorithm (not changing the epsilon with each episode), will it converge to the optimal solution to the MDP? I am observing that sometimes it does not. Specifically, example 6.5 in Introduction to RL by Sutton and Barto, when I run Sarsa for 500 episodes and form a greedy policy (no exploration after 500 episodes) using the Q-values of epsilon greedy policy, I am getting different results. Is the reason for suboptimal solution that we are converging the Q values to the estimates of next state action which could be random in case of epsilon greedy (with no epsilon decay)?

Here are the two greedy policies (from epsilon greedy policy iterated for 500 episodes) with two different runs of Sarsa:

Policy 1 Policy 2

1 Answers1

1

In general on-policy Sarsa algo with your epsilon greedy behavior policy actually is the action value version of TD(0) algo, thus it has the same convergence property of TD(0) and is confirmed in the same section of your example reference:

The convergence properties of the Sarsa algorithm depend on the nature of the policy’s dependence on $Q$. For example, one could use "$\epsilon$-greedy or "$\epsilon$-soft policies. Sarsa converges with probability 1 to an optimal policy and action-value function, under the usual conditions on the step sizes (2.7), as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy

In practice your learning algo may not strictly satisfy all the above mentioned theoretical stochastic convergence conditions such as the usual possibly constant learning rate and your specific problem's all state–action pairs won't be visited an infinite number of times from your mere 500 episodes. You may try to increase your episodes or epsilon values to see better convergence result. Your same reference also mentioned this practical issue yet sometimes it's even desirable:

but not for the case of constant step-size parameter, $\alpha_n(a)=\alpha$. In the latter case, the second condition is not met, indicating that the estimates never completely converge but continue to vary in response to the most recently received rewards. As we mentioned above, this is actually desirable in a nonstationary environment, and problems that are effectively nonstationary are the most common in reinforcement learning. In addition, sequences of step-size parameters that meet the conditions (2.7) often converge very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Although sequences of step-size parameters that meet these convergence conditions are often used in theoretical work, they are seldom used in applications and empirical research.

cinch
  • 11,000
  • 3
  • 8
  • 17