1

Is it correct that for SARSA to converge to the optimal value function (and policy)

  1. The learning rate parameter $\alpha$ must satisfy the conditions: $$\sum \alpha_{n^k(s,a)} =\infty \quad \text{and}\quad \sum \alpha_{n^k(s,a)}^{2} <\infty \quad \forall s \in \mathcal{S}$$ where $n_k(s,a)$ denotes the $k^\text{th}$ time $(s,a)$ is visited

  2. $\epsilon$ (of the $\epsilon$-greedy policy) must be decayed so that the policy converges to a greedy policy.

  3. Every state-action pair is visited infinitely many times.

Are any of these conditions redundant?

nbro
  • 42,615
  • 12
  • 119
  • 217
KaneM
  • 307
  • 2
  • 13

2 Answers2

2

The paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms by Satinder Singh et al. proves that SARSA(0), in the case of a tabular representation of the value functions, converges to the optimal value function, provided certain assumptions are met

  1. Infinite visits to every state-action pair
  2. The learning policy becomes greedy in the limit

The properties are more formally stated in lemma 1 (page 7 of the pdf) and theorem 1 (page 8). The Robbins–Monro conditions should ensure that each state-action pair is visited infinitely often.

nbro
  • 42,615
  • 12
  • 119
  • 217
1

I have the conditions for convergence in these notes SARSA convergence by Nahum Shimkin.

  1. The Robbins-Monro conditions above hold for $α_t$.

  2. Every state-action pair is visited infinitely often

  3. The policy is greedy with respect to the policy derived from $Q$ in the limit

  4. The controlled Markov chain is communicating: every state can be reached from any other with positive probability (under some policy).

  5. $\operatorname{Var}{R(s, a)} < \infty$, where $R$ is the reward function

nbro
  • 42,615
  • 12
  • 119
  • 217
KaneM
  • 307
  • 2
  • 13