What are the conditions for the convergence of SARSA to the optimal value function?

Question

Is it correct that for SARSA to converge to the optimal value function (and policy)

The learning rate parameter $\alpha$ must satisfy the conditions: $$\sum \alpha_{n^k(s,a)} =\infty \quad \text{and}\quad \sum \alpha_{n^k(s,a)}^{2} <\infty \quad \forall s \in \mathcal{S}$$ where $n_k(s,a)$ denotes the $k^\text{th}$ time $(s,a)$ is visited
$\epsilon$ (of the $\epsilon$-greedy policy) must be decayed so that the policy converges to a greedy policy.
Every state-action pair is visited infinitely many times.

Are any of these conditions redundant?

score 2 · Accepted Answer · answered Feb 27 '20 at 13:40

The paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms by Satinder Singh et al. proves that SARSA(0), in the case of a tabular representation of the value functions, converges to the optimal value function, provided certain assumptions are met

The properties are more formally stated in lemma 1 (page 7 of the pdf) and theorem 1 (page 8). The Robbins–Monro conditions should ensure that each state-action pair is visited infinitely often.

score 1 · Answer 2 · edited Feb 27 '20 at 14:10

I have the conditions for convergence in these notes SARSA convergence by Nahum Shimkin.

The Robbins-Monro conditions above hold for $α_t$.
Every state-action pair is visited infinitely often
The policy is greedy with respect to the policy derived from $Q$ in the limit
The controlled Markov chain is communicating: every state can be reached from any other with positive probability (under some policy).
$\operatorname{Var}{R(s, a)} < \infty$, where $R$ is the reward function

2 Answers2