Highest Voted 'sarsa' Questions - Artificial Intelligence Stack Exchange

11

votes

2 answers

Are Q-learning and SARSA the same when action selection is greedy?

I'm currently studying reinforcement learning and I'm having difficulties with question 6.12 in Sutton and Barto's book. Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as SARSA? Will they make exactly the same…

asked May 10 '20 at 10:52

hyuj

131
4

10

votes

1 answer

Can Q-learning be used in a POMDP?

Can Q-learning (and SARSA) be directly used in a Partially Observable Markov Decision Process (POMDP)? If not, why not? My intuition is that the policies learned will be terrible because of partial observability. Are there ways to transform these…

reinforcement-learning q-learning pomdp markov-decision-process sarsa

asked Apr 03 '19 at 02:40

drerD

298
2
6

8

votes

2 answers

How should I handle action selection in the terminal state when implementing SARSA?

I recently started learning about reinforcement learning. Currently, I am trying to implement the SARSA algorithm. However, I do not know how to deal with $Q(s', a')$, when $s'$ is the terminal state. First, there is no action to choose from in this…

reinforcement-learning implementation sarsa

asked Aug 04 '17 at 13:45

Hai Nguyen

572
5
14

7

votes

2 answers

In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…

reinforcement-learning sutton-barto off-policy-methods temporal-difference-methods sarsa

asked Apr 05 '19 at 14:23

Antoine Savine

173
4

6

votes

1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…

reinforcement-learning off-policy-methods sarsa on-policy-methods expected-sarsa

asked Apr 20 '20 at 18:37

Y. Xu

63
1
4

5

votes

1 answer

Expected SARSA vs SARSA in "RL: An Introduction"

Sutton and Barto state in the 2018-version of "Reinforcement Learning: An Introduction" in the context of Expected SARSA (p. 133) the following sentences: Expected SARSA is more complex computationally than Sarsa but, in return, it eliminates the…

reinforcement-learning sutton-barto sarsa expected-sarsa

asked Feb 21 '19 at 19:55

F.M.F.

321
3
7

4

votes

2 answers

In the frozen lake environment of Gymnasium, why aren't the holes negatively rewarded?

In this given map, for example, the agent needs to perform the downward action twice to reach the reward. Considering that initially all actions are equally likely, the probability of reaching the reward is really low. If the agent never encounters…

reinforcement-learning q-learning gym sarsa

asked Jan 13 '25 at 22:23

DeadAsDuck

103
6

4

votes

2 answers

Is the optimal policy the one with the highest accumulative reward (Q-Learning vs SARSA)?

I was looking at the following diagram, The reward obtained with SARSA is higher. However, the path that Q learning chooses is eventually the optimal one, isn't it? Why is the SARSA reward higher if it is not choosing the best path? shouldn't the…

reinforcement-learning q-learning sarsa

asked Jan 06 '22 at 14:38

Pulse9

282
1
7

4

votes

1 answer

How should I generate datasets for a SARSA agent when the environment is not simple?

I am currently working on my master's thesis and going to apply Deep-SARSA as my DRL algorithm. The problem is that there is no datasets available and I guess that I should generate them somehow. Datasets generation seems a common feature in this…

reinforcement-learning datasets environment sarsa on-policy-methods

asked Jan 06 '21 at 07:26

Shahin

153
4

4

votes

1 answer

When do SARSA and Q-Learning converge to optimal Q values?

Here's another interesting multiple-choice question that puzzles me a bit. In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state, randomly selects an action, then: Q-learning will…

reinforcement-learning q-learning convergence sarsa

asked Aug 09 '20 at 15:35

stoic-santiago

1,201
9
22

3

votes

1 answer

Convergence of epsilon greedy policy (with no epsilon decay) using TD Learning?

If I create a policy using the q-values of an epsilon greedy policy using the Sarsa algorithm (not changing the epsilon with each episode), will it converge to the optimal solution to the MDP? I am observing that sometimes it does not. Specifically,…

reinforcement-learning temporal-difference-methods sarsa

asked Nov 04 '23 at 15:46

Prabhjot Singh Rai

131
3

3

votes

1 answer

Can we also estimate $V_{\pi}$ with SARSA?

For SARSA, I know we can estimate the action value $Q(s,a)$, and the relationship between $V(s)$ and $Q(s,a)$ is $V_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s)Q_{\pi} (s,a)$. So my question is, can we simply estimate $V_{\pi}$ by applying the above…

reinforcement-learning value-functions bellman-equations sarsa

asked May 20 '22 at 04:23

Dingzhi Hu

31
1

3

votes

1 answer

When does backward propagation occur in n-step SARSA?

I am trying to understand the algorithm for n-step SARSA from Sutton and Barto (2nd Edition). As I understand it, this algorithm should update n state-action values, but I cannot see where it is propagated backward. Can someone explain to me how…

reinforcement-learning sutton-barto sarsa

asked Feb 15 '17 at 15:39

nehalem

131
2

3

votes

1 answer

How to determine if Q-learning has converged in practice？

I am using Q-learning and SARSA to solve a problem. The agent learns to go from the start to the goal without falling in the holes. At each state, I can choose the action corresponding to the maximum Q value at the state (the greedy action that the…

reinforcement-learning q-learning convergence temporal-difference-methods sarsa

asked Oct 14 '20 at 10:32

WANGWANGZI

31
4

3

votes

1 answer

How are we calculating the average reward ($r(\pi)$) if the policy changes over time?

In the average reward setting, the quality of a policy is defined as: $$ r(\pi) = \lim_{h\to\infty}\frac{1}{h} \sum_{j=1}^{h}E[R_j] $$ When we reach the steady state distribution, we can write the above equation as follows: $$ r(\pi) =…

reinforcement-learning deep-rl sarsa

asked Aug 27 '20 at 20:29

Swakshar Deb

703
4
12

Questions tagged [sarsa]