Highest Voted 'off-policy-methods' Questions - Artificial Intelligence Stack Exchange

15

votes

1 answer

What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?

In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning. What is the relation…

asked Feb 09 '19 at 14:48

nbro

42,615
12
119
217

12

votes

1 answer

Do off-policy policy gradient methods exist?

Do off-policy policy gradient methods exist? I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

reinforcement-learning policy-gradients off-policy-methods

asked Dec 23 '17 at 18:41

echo

713
1
6
12

11

votes

1 answer

Why is the n-step tree backup algorithm an off-policy algorithm?

In reinforcement learning book from Sutton & Barto (2018 edition), specifically in section 7.5 of the book, they present an n-step off-policy algorithm that doesn't require importance sampling called n-step tree backup algorithm. In other…

reinforcement-learning off-policy-methods

asked Dec 13 '18 at 21:07

Brale

2,416
1
7
15

7

votes

2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…

reinforcement-learning off-policy-methods sutton-barto importance-sampling

asked Jan 09 '19 at 17:28

F.M.F.

321
3
7

7

votes

2 answers

In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…

reinforcement-learning sutton-barto off-policy-methods temporal-difference-methods sarsa

asked Apr 05 '19 at 14:23

Antoine Savine

173
4

6

votes

2 answers

What is the difference between on and off-policy deterministic actor-critic?

In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic". I don't know what's the difference between two algorithms. I only noticed that the…

reinforcement-learning terminology actor-critic-methods on-policy-methods off-policy-methods

asked May 09 '18 at 08:41

fish_tree

247
2
6

6

votes

1 answer

Why do we need importance sampling?

I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling. For instance,…

reinforcement-learning monte-carlo-methods off-policy-methods importance-sampling

asked Jan 04 '21 at 01:43

Alireza Hosseini

61
3

6

votes

1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…

reinforcement-learning off-policy-methods sarsa on-policy-methods expected-sarsa

asked Apr 20 '20 at 18:37

Y. Xu

63
1
4

5

votes

1 answer

Why does off-policy learning outperform on-policy learning?

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works. I saw this in a book: Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked Nov 26 '20 at 03:14

Exploring

371
7
18

5

votes

1 answer

Are model-free and off-policy algorithms the same?

In respect of RL, is model-free and off-policy the same thing, just different terminology? If not, what are the differences? I've read that the policy can be thought of as 'the brain', or decision making part, of machine learning application, where…

reinforcement-learning comparison terminology off-policy-methods model-free-methods

asked Feb 01 '20 at 17:30

mason7663

653
4
12

5

votes

1 answer

How do I compute the variance of the return of an evaluation policy using two behaviour policies?

Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…

reinforcement-learning policies off-policy-methods

asked Jan 17 '19 at 19:45

Amin

481
2
12

4

votes

3 answers

What is the intuition behind importance sampling for off-policy value evaluation?

The technique for off-policy value evaluation comes from importance sampling, which states that $$E_{x \sim q}[f(x)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i)\frac{q(x_i)}{p(x_i)},$$ where $x_i$ is sampled from $p$. In the application of importance…

reinforcement-learning off-policy-methods importance-sampling

asked May 22 '20 at 09:54

calveeen

1,311
9
18

4

votes

1 answer

What is the difference between on-policy and off-policy for continuous environments?

I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space. First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked May 18 '20 at 15:11

unter_983

331
1
7

3

votes

1 answer

Why is the actor-critic algorithm limited to using on-policy data?

Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?

reinforcement-learning actor-critic-methods on-policy-methods off-policy-methods

asked Jan 06 '19 at 17:12

apuffin

41
2

3

votes

1 answer

Why are Q values updated according to the greedy policy?

Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not…

reinforcement-learning q-learning off-policy-methods greedy-policy

asked Nov 17 '18 at 16:23

Shifat E Arman

83
3
5

Questions tagged [off-policy-methods]