For questions related to the reinforcement learning (on-policy) algorithm called SARSA, which stands for (s, a, r, s', a').
Questions tagged [sarsa]
46 questions
11
votes
2 answers
Are Q-learning and SARSA the same when action selection is greedy?
I'm currently studying reinforcement learning and I'm having difficulties with question 6.12 in Sutton and Barto's book.
Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as SARSA? Will they make exactly the same…
hyuj
- 131
- 4
10
votes
1 answer
Can Q-learning be used in a POMDP?
Can Q-learning (and SARSA) be directly used in a Partially Observable Markov Decision Process (POMDP)? If not, why not? My intuition is that the policies learned will be terrible because of partial observability. Are there ways to transform these…
drerD
- 298
- 2
- 6
8
votes
2 answers
How should I handle action selection in the terminal state when implementing SARSA?
I recently started learning about reinforcement learning. Currently, I am trying to implement the SARSA algorithm. However, I do not know how to deal with $Q(s', a')$, when $s'$ is the terminal state. First, there is no action to choose from in this…
Hai Nguyen
- 572
- 5
- 14
7
votes
2 answers
In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?
In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11
I am having a hard time understanding this equation.
I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
Antoine Savine
- 173
- 4
6
votes
1 answer
Is Expected SARSA an off-policy or on-policy algorithm?
I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one.
Sutton and Barto's textbook describes Expected Sarsa thusly:
In these cliff walking results Expected Sarsa was used on-policy, but
in general it might use a…
Y. Xu
- 63
- 1
- 4
5
votes
1 answer
Expected SARSA vs SARSA in "RL: An Introduction"
Sutton and Barto state in the 2018-version of "Reinforcement Learning: An Introduction" in the context of Expected SARSA (p. 133) the following sentences:
Expected SARSA is more complex computationally than Sarsa but, in return, it eliminates the…
F.M.F.
- 321
- 3
- 7
4
votes
2 answers
In the frozen lake environment of Gymnasium, why aren't the holes negatively rewarded?
In this given map, for example, the agent needs to perform the downward action twice to reach the reward. Considering that initially all actions are equally likely, the probability of reaching the reward is really low. If the agent never encounters…
DeadAsDuck
- 103
- 6
4
votes
2 answers
Is the optimal policy the one with the highest accumulative reward (Q-Learning vs SARSA)?
I was looking at the following diagram,
The reward obtained with SARSA is higher. However, the path that Q learning chooses is eventually the optimal one, isn't it? Why is the SARSA reward higher if it is not choosing the best path? shouldn't the…
Pulse9
- 282
- 1
- 7
4
votes
1 answer
How should I generate datasets for a SARSA agent when the environment is not simple?
I am currently working on my master's thesis and going to apply Deep-SARSA as my DRL algorithm. The problem is that there is no datasets available and I guess that I should generate them somehow. Datasets generation seems a common feature in this…
Shahin
- 153
- 4
4
votes
1 answer
When do SARSA and Q-Learning converge to optimal Q values?
Here's another interesting multiple-choice question that puzzles me a bit.
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state, randomly selects an action, then:
Q-learning will…
stoic-santiago
- 1,201
- 9
- 22
3
votes
1 answer
Convergence of epsilon greedy policy (with no epsilon decay) using TD Learning?
If I create a policy using the q-values of an epsilon greedy policy using the Sarsa algorithm (not changing the epsilon with each episode), will it converge to the optimal solution to the MDP? I am observing that sometimes it does not. Specifically,…
Prabhjot Singh Rai
- 131
- 3
3
votes
1 answer
Can we also estimate $V_{\pi}$ with SARSA?
For SARSA, I know we can estimate the action value $Q(s,a)$, and the relationship between $V(s)$ and $Q(s,a)$ is $V_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s)Q_{\pi}
(s,a)$.
So my question is, can we simply estimate $V_{\pi}$ by applying the above…
Dingzhi Hu
- 31
- 1
3
votes
1 answer
When does backward propagation occur in n-step SARSA?
I am trying to understand the algorithm for n-step SARSA from Sutton and Barto (2nd Edition). As I understand it, this algorithm should update n state-action values, but I cannot see where it is propagated backward. Can someone explain to me how…
nehalem
- 131
- 2
3
votes
1 answer
How to determine if Q-learning has converged in practice?
I am using Q-learning and SARSA to solve a problem. The agent learns to go from the start to the goal without falling in the holes.
At each state, I can choose the action corresponding to the maximum Q value at the state (the greedy action that the…
WANGWANGZI
- 31
- 4
3
votes
1 answer
How are we calculating the average reward ($r(\pi)$) if the policy changes over time?
In the average reward setting, the quality of a policy is defined as:
$$ r(\pi) = \lim_{h\to\infty}\frac{1}{h} \sum_{j=1}^{h}E[R_j] $$
When we reach the steady state distribution, we can write the above equation as follows:
$$ r(\pi) =…
Swakshar Deb
- 703
- 4
- 12