Highest Voted 'trust-region-policy-optimization' Questions - Artificial Intelligence Stack Exchange

17

votes

1 answer

How can policy gradients be applied in the case of multiple continuous actions?

Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms. When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

asked Sep 21 '17 at 08:27

Evalds Urtans

377
4
9

7

votes

2 answers

Why is the log probability replaced with the importance sampling in the loss function?

In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients $$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…

reinforcement-learning deep-rl proximal-policy-optimization importance-sampling trust-region-policy-optimization

asked Aug 23 '18 at 07:17

Mark

106
4

5

votes

1 answer

What is the difference between an on-policy distribution and state visitation frequency?

On-policy distribution is defined as follows in Sutton and Barto: On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization: $$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$ Question: What is…

reinforcement-learning comparison sutton-barto trust-region-policy-optimization on-policy-distribution

asked Dec 08 '21 at 10:36

user529295

379
2
12

5

votes

1 answer

Is (log-)standard deviation learned in TRPO and PPO or fixed instead?

After having read Williams (1992), where it was suggested that actually both the mean and standard deviation can be learned while training a REINFORCE algorithm on generating continuous output values, I assumed that this would be common practice…

reinforcement-learning deep-rl proximal-policy-optimization trust-region-policy-optimization

asked Feb 12 '21 at 13:25

Daniel B.

835
1
6
14

4

votes

1 answer

In lemma 1 of the TRPO paper, why isn't the expectation over $s'∼P(s'|s,a)$?

In the Trust Region Policy Optimization paper, in Lemma 1 of Appendix A, I didn't quite understand the transition from (21) from (20). In going from (20) to (21), $A^\pi(s_t, a_t)$ is substituted with its value. The value of $A^\pi(s_t, a_t)$ is…

reinforcement-learning proofs papers trust-region-policy-optimization

asked Nov 21 '19 at 22:38

A Das

141
2

3

votes

1 answer

Maximizing or Minimizing in Trust Region Policy Optimization?

I happened to discover that the v1 (19 Feb 2015) and the v5 (20 Apr 2017) versions of TRPO papers have two different conclusions. The Equation (15) in v1 is $\min_\theta$ while the Equation (14) in v2 is $\max_\theta$. So, I'm a little bit confused…

reinforcement-learning optimization deep-rl trust-region-policy-optimization

asked Jul 15 '18 at 08:43

fish_tree

247
2
6

3

votes

1 answer

How can I implement the reward function for an 8-DOF robot arm with TRPO?

I need to get an 8-DOF (degrees of freedom) robot arm to move a specified point. I need to implement the TRPO RL code using OpenAI gym. I already have the gazebo environment. But I am unsure of how to write the code for the reward functions and the…

reinforcement-learning gym reward-design reward-functions trust-region-policy-optimization

asked Mar 05 '20 at 02:10

user1690356

31
1

3

votes

1 answer

Are these two TRPO objective functions equivalent?

In the TRPO paper, the objective to maximize is (equation 14) $$ \mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right] $$ which involves an expectation over states sampled with some…

reinforcement-learning policy-gradients proximal-policy-optimization trust-region-policy-optimization

asked Oct 07 '19 at 05:15

udscbt

31
2

2

votes

1 answer

How is inequality 31 derived from equality 30 in lemma 2 of the "Trust Region Policy Optimization" paper?

In the Trust Region Policy Optimization paper, in Lemma 2 of Appendix A (p. 11), I didn't quite understand how inequality (31) is derived from equality (30), which is: $$\bar{A}(s) = P(a \neq \tilde{a} | s) \mathbb{E}_{(a, \tilde{a}) \sim (\pi,…

reinforcement-learning deep-rl papers proofs trust-region-policy-optimization

asked Nov 27 '18 at 16:52

Afshin Oroojlooy

175
1
7

2

votes

1 answer

What makes TRPO an actor-critic method? Where is the critic?

From what I understand, Trust Region Policy Optimization (TRPO) is a modification on Natural Policy Gradient (NPG) that derives the optimal step size $\beta$ from a KL constraint between the new and old policy. NPG is a modification to "vanilla"…

reinforcement-learning policy-gradients actor-critic-methods trust-region-policy-optimization policy-based-methods

asked Oct 25 '22 at 20:28

thesofakillers

329
3
15

2

votes

0 answers

How does the TRPO surrogate loss account for the error in the policy?

In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the …

reinforcement-learning deep-rl papers policy-gradients trust-region-policy-optimization

asked May 02 '19 at 15:31

olliejday

21
3

1

vote

1 answer

Trust region policy optimization guarantees for continuous case

The original TRPO paper stated an algorithm that used optimization of the following surrogate objective: $$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$ where $\eta$ is the value function, $\rho_\pi$ is…

reinforcement-learning deep-rl papers proofs trust-region-policy-optimization

asked Jan 07 '25 at 11:23

Rubi Shnol

121
3

1

vote

1 answer

Confusion between versions of trust-region policy optimziation

The original TRPO paper stated an algorithm that used optimization of the following surrogate objective: $$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$ where $\eta$ is the value function, $\rho_\pi$ is…

reinforcement-learning deep-rl papers proofs trust-region-policy-optimization

asked Jan 06 '25 at 13:42

Rubi Shnol

121
3

1

vote

1 answer

Unclear point in TRPO

Slide 11, https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-spring18/slides/cs885-lecture15b.pdf Why is $t$ included under the expectation? Normally, instead of "t", I would expect $(a_t,s_t) \sim \pi_{\theta_{old}} $.

reinforcement-learning trust-region-policy-optimization

asked Sep 03 '24 at 21:57

DSPinfinity

1,223
4
10

1

vote

1 answer

Does importance sampling really improve sampling efficiency of TRPO or PPO?

Vanilla policy gradient has a loss function: $$\mathcal{L}_{\pi_{\theta}(\theta)} = E_{\tau \sim \pi_{\theta}}[\sum\limits_{t = 0}^{\infty}\gamma^{t}r_{t}]$$ while in TRPO it is: $$\mathcal{L}_{\pi_{\theta_{old}}(\theta)} = \frac{1}{1 - \gamma}E_{s,…

reinforcement-learning policy-gradients importance-sampling trust-region-policy-optimization sample-efficiency

asked Feb 16 '22 at 11:22

Magi Feeney

51
1
5

Questions tagged [trust-region-policy-optimization]