0

Sutton-Barto, page 132:

enter image description here

The graph to the right shows the performance of the Sarsa and Qlearning methods with "-greedy action selection, " = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cli↵. Unfortunately, this results in its occasionally falling o↵ the cli↵ because of the "-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid.

SARSA is also using $\epsilon$-greedy and hence falling off cliff may happen with the same reasoning done for the Q-learning. I did not understand why this (falling off cliff) does not happen in SARSA.

DSPinfinity
  • 1,223
  • 4
  • 10

1 Answers1

3

I think the point here is that Q-learning may learn the optimal policy or value function faster. The optimal policy is to choose actions that are close to the cliff, but, during learning, to behave, you still use the $\epsilon$-greedy policy which is based on the (near-)optimal value function, so you may occasionally fall off the cliff. SARSA may learn a different value function, which during learning doesn't make the agent fall off the cliff as often. If we were not learning, and Q-learning had found the optimal policy, it would always act greedily, so it would not fall off the cliff (assuming the optimal policy doesn't do that).

nbro
  • 42,615
  • 12
  • 119
  • 217