Sutton-Barto, page 132:
The graph to the right shows the performance of the Sarsa and Qlearning methods with "-greedy action selection, " = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cli↵. Unfortunately, this results in its occasionally falling o↵ the cli↵ because of the "-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid.
SARSA is also using $\epsilon$-greedy and hence falling off cliff may happen with the same reasoning done for the Q-learning. I did not understand why this (falling off cliff) does not happen in SARSA.
