1

I have some difficulties understanding the difference between Q-learning and SARSA. Here (What are the differences between SARSA and Q-learning?) the following updating formulas are given:

Q-Learning

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma \max_aQ(s',a) - Q(s,a))$$

SARSA

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma Q(s',a') - Q(s,a))$$

I know that SARSA is for on-policy learning while Q-learning is off-policy learning. So, in Q-learning, the epsilon-greedy policy (or epsilon-soft or softmax policy) is chosen for selecting the actions and the greedy policy is chosen for updating the Q-values. In SARSA the epsilon-greedy policy (or epsilon-soft or softmax policy) is chosen for selecting the actions and for updating the Q function.

So, actually, I have a question on that:

On this website (https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html) there is written for SARSA

As you can see, there are two action selection steps needed, for determining the next state-action pair along with the first.

What is meant by two action selections? Normally you can only select one action per iteration. The other "selection" should be for the update.

hanugm
  • 4,102
  • 3
  • 29
  • 63
PeterBe
  • 276
  • 3
  • 14

1 Answers1

3

In my view, the best way to understand these algorithms is to read the pseudocode (multiple times, if necessary!).

Here's the pseudocode of Q-learning.

enter image description here

Here's the pseudocode of SARSA.

enter image description here

So, as you can see, in SARSA, we choose one action before the episode starts, and, during the episode, we choose (and take) again more actions. In both cases, we choose these actions with the same policy (e.g. $\epsilon$-greedy), which is derived from $Q$. In Q-learning, we do not choose an action before the episode starts. We only choose and take an action at each step of the episode (like in SARSA). Hence, in SARSA, we choose actions in two places (but only take an action at each step of the episode). Note the difference between choosing/selecting an action and taking an action in the environment (you may just choose an action to update the Q-function, i.e. without taking it into the environment!).

nbro
  • 42,615
  • 12
  • 119
  • 217