For questions related to off-policy reinforcement learning algorithms, which estimate a policy (the target policy) while using another policy (the behavior policy), during the learning process, which ensures that all states are sufficiently explored. An example of an off-policy algorithm is Q-learning.
Questions tagged [off-policy-methods]
76 questions
15
votes
1 answer
What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?
In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning.
What is the relation…
nbro
- 42,615
- 12
- 119
- 217
12
votes
1 answer
Do off-policy policy gradient methods exist?
Do off-policy policy gradient methods exist?
I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.
echo
- 713
- 1
- 6
- 12
11
votes
1 answer
Why is the n-step tree backup algorithm an off-policy algorithm?
In reinforcement learning book from Sutton & Barto (2018 edition), specifically in section 7.5 of the book, they present an n-step off-policy algorithm that doesn't require importance sampling called n-step tree backup algorithm.
In other…
Brale
- 2,416
- 1
- 7
- 15
7
votes
2 answers
How can the importance sampling ratio be different than zero when the target policy is deterministic?
In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows:
$$\rho…
F.M.F.
- 321
- 3
- 7
7
votes
2 answers
In the n-step off-policy SARSA update, why do we multiply the entire update by $\rho$?
In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11
I am having a hard time understanding this equation.
I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
Antoine Savine
- 173
- 4
6
votes
2 answers
What is the difference between on and off-policy deterministic actor-critic?
In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic".
I don't know what's the difference between two algorithms.
I only noticed that the…
fish_tree
- 247
- 2
- 6
6
votes
1 answer
Why do we need importance sampling?
I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling.
For instance,…
Alireza Hosseini
- 61
- 3
6
votes
1 answer
Is Expected SARSA an off-policy or on-policy algorithm?
I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one.
Sutton and Barto's textbook describes Expected Sarsa thusly:
In these cliff walking results Expected Sarsa was used on-policy, but
in general it might use a…
Y. Xu
- 63
- 1
- 4
5
votes
1 answer
Why does off-policy learning outperform on-policy learning?
I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works.
I saw this in a book:
Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…
Exploring
- 371
- 7
- 18
5
votes
1 answer
Are model-free and off-policy algorithms the same?
In respect of RL, is model-free and off-policy the same thing, just different terminology? If not, what are the differences? I've read that the policy can be thought of as 'the brain', or decision making part, of machine learning application, where…
mason7663
- 653
- 4
- 12
5
votes
1 answer
How do I compute the variance of the return of an evaluation policy using two behaviour policies?
Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…
Amin
- 481
- 2
- 12
4
votes
3 answers
What is the intuition behind importance sampling for off-policy value evaluation?
The technique for off-policy value evaluation comes from importance sampling, which states that
$$E_{x \sim q}[f(x)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i)\frac{q(x_i)}{p(x_i)},$$ where $x_i$ is sampled from $p$.
In the application of importance…
calveeen
- 1,311
- 9
- 18
4
votes
1 answer
What is the difference between on-policy and off-policy for continuous environments?
I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space.
First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…
unter_983
- 331
- 1
- 7
3
votes
1 answer
Why is the actor-critic algorithm limited to using on-policy data?
Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?
apuffin
- 41
- 2
3
votes
1 answer
Why are Q values updated according to the greedy policy?
Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not…
Shifat E Arman
- 83
- 3
- 5