For questions about the Trust Region Policy Optimization (TRPO) algorithm, which was introduced in the paper "Trust Region Policy Optimization" (2015) by J. Schulman et al.
Questions tagged [trust-region-policy-optimization]
21 questions
17
votes
1 answer
How can policy gradients be applied in the case of multiple continuous actions?
Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms.
When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…
Evalds Urtans
- 377
- 4
- 9
7
votes
2 answers
Why is the log probability replaced with the importance sampling in the loss function?
In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients
$$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…
Mark
- 106
- 4
5
votes
1 answer
What is the difference between an on-policy distribution and state visitation frequency?
On-policy distribution is defined as follows in Sutton and Barto:
On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization:
$$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$
Question: What is…
user529295
- 379
- 2
- 12
5
votes
1 answer
Is (log-)standard deviation learned in TRPO and PPO or fixed instead?
After having read Williams (1992), where it was suggested that actually both the mean and standard deviation can be learned while training a REINFORCE algorithm on generating continuous output values, I assumed that this would be common practice…
Daniel B.
- 835
- 1
- 6
- 14
4
votes
1 answer
In lemma 1 of the TRPO paper, why isn't the expectation over $s'∼P(s'|s,a)$?
In the Trust Region Policy Optimization paper, in Lemma 1 of Appendix A, I didn't quite understand the transition from (21) from (20). In going from (20) to (21), $A^\pi(s_t, a_t)$ is substituted with its value. The value of $A^\pi(s_t, a_t)$ is…
A Das
- 141
- 2
3
votes
1 answer
Maximizing or Minimizing in Trust Region Policy Optimization?
I happened to discover that the v1 (19 Feb 2015) and the v5 (20 Apr 2017) versions of TRPO papers have two different conclusions. The Equation (15) in v1 is $\min_\theta$ while the Equation (14) in v2 is $\max_\theta$. So, I'm a little bit confused…
fish_tree
- 247
- 2
- 6
3
votes
1 answer
How can I implement the reward function for an 8-DOF robot arm with TRPO?
I need to get an 8-DOF (degrees of freedom) robot arm to move a specified point. I need to implement the TRPO RL code using OpenAI gym. I already have the gazebo environment. But I am unsure of how to write the code for the reward functions and the…
user1690356
- 31
- 1
3
votes
1 answer
Are these two TRPO objective functions equivalent?
In the TRPO paper, the objective to maximize is (equation 14)
$$
\mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right]
$$
which involves an expectation over states sampled with some…
udscbt
- 31
- 2
2
votes
1 answer
How is inequality 31 derived from equality 30 in lemma 2 of the "Trust Region Policy Optimization" paper?
In the Trust Region Policy Optimization paper, in Lemma 2 of Appendix A (p. 11), I didn't quite understand how inequality (31) is derived from equality (30), which is:
$$\bar{A}(s) = P(a \neq \tilde{a} | s) \mathbb{E}_{(a, \tilde{a}) \sim (\pi,…
Afshin Oroojlooy
- 175
- 1
- 7
2
votes
1 answer
What makes TRPO an actor-critic method? Where is the critic?
From what I understand, Trust Region Policy Optimization (TRPO) is a modification on Natural Policy Gradient (NPG) that derives the optimal step size $\beta$ from a KL constraint between the new and old policy.
NPG is a modification to "vanilla"…
thesofakillers
- 329
- 3
- 15
2
votes
0 answers
How does the TRPO surrogate loss account for the error in the policy?
In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated
An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the …
olliejday
- 21
- 3
1
vote
1 answer
Trust region policy optimization guarantees for continuous case
The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:
$$
L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a)
$$
where $\eta$ is the value function, $\rho_\pi$ is…
Rubi Shnol
- 121
- 3
1
vote
1 answer
Confusion between versions of trust-region policy optimziation
The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:
$$
L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a)
$$
where $\eta$ is the value function, $\rho_\pi$ is…
Rubi Shnol
- 121
- 3
1
vote
1 answer
Unclear point in TRPO
Slide 11, https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-spring18/slides/cs885-lecture15b.pdf
Why is $t$ included under the expectation? Normally, instead of "t", I would expect $(a_t,s_t) \sim \pi_{\theta_{old}} $.
DSPinfinity
- 1,223
- 4
- 10
1
vote
1 answer
Does importance sampling really improve sampling efficiency of TRPO or PPO?
Vanilla policy gradient has a loss function:
$$\mathcal{L}_{\pi_{\theta}(\theta)} = E_{\tau \sim \pi_{\theta}}[\sum\limits_{t = 0}^{\infty}\gamma^{t}r_{t}]$$
while in TRPO it is:
$$\mathcal{L}_{\pi_{\theta_{old}}(\theta)} = \frac{1}{1 - \gamma}E_{s,…
Magi Feeney
- 51
- 1
- 5