Trust region policy optimization guarantees for continuous case

Question

The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:

$$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$

where $\eta$ is the value function, $\rho_\pi$ is the unnormalized discounted visitation frequency, $\pi, \tilde \pi$ are the old, resp., new policy, $A_\pi$ is the advantage of action $a$ over policy $\pi$. The objective $L_\pi$ is a local approximation to the value under the new policy.

The key theoretical result of the original paper, and seemingly all of the TRPO theory is the value comparison:

$$ \eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) $$

where $C$ is a constant that depends on the discount factor and maximum advantage over old policy, and $D_{\mathrm{KL}}^{\max }$ is the state-maximized KL-divergence between the old and new policy.

With this result at hand (maximizing the surrogate objective s.t. a KL-divergence constraint), it's straightforward to show non-decrease of value under TRPO.

The proofs go along two routes:

through $\alpha$-coupling of policies
through perturbation theory

Both make quite explicit use of finiteness of the state space.

As Schulman et al. claimed in the paper, the performance difference result for the case of continuous state and action spaces is essentially the same substituting sums for integrals

I fail to see that and would appreciate a clarification.

In fact, it's even more complicated as practical surrogate objectives used in TRPO, PPO, do not utilize visitation measures and render objectives differently for which in turn the said guarantees do not extend trivially.

Recent papers like this one and this one made claims that guarantees of TRPO for the general state and action space case were kinda shaky in earlier papers.

What they suggest is to force convexity via regularization.

cinch · Answer 1 · 2025-01-16T08:39:33.673

From your previous post, you might already understand that the discrete sum or its corresponding expectation in TRPO's surrogate objective's weighted advantage term $\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a)$ can be substituted with integrals straightforwardly. Even in TRPO's practical implementation via the trajectory-based surrogate objective, you also just need weighted integral substitution for weighted discrete trajectories sum. And since in your previous post I've shown a proof for the equivalence of the two objectives, there won't be theoretical performance difference.

Further note the expectation of the weighted advantage $A_π(s,a)$ wrt $(s \sim ρ_π(s), a \sim \tilde{π}(a|s))$ in above OP's surrogate objective can be morphed to an expectation wrt $(s \sim ρ_π(s), a \sim π(a|s))$ of $π$ alone after adjusting the original expectation argument with importance sampling expressed as $\frac{\tilde{π}(a|s)}{π(a|s)}A_π(s,a)$. Also note from above OP's surrogate objective, the net effect of TRPO's gradient update along with its step size is just this importance sampling adjusted advantage expectation wrt the old policy's state-action pairs, thus policy gradient ascent is equivalent to find an argmax as the new policy's parameters of the morphed advantage expectation. Since $D_{\mathrm{KL}}^{\max }(π, \tilde{π})$ is small enough and applicable to both discrete PMF and continuous PDF, the argmax optimization almost always has a solution, thus TRPO almost always ensures to improve $π$ during each gradient update, regardless discrete or continuous state and action spaces. Without the KL constraint, policy-gradient (PG) algorithms cannot ensure the same due to many practical factors such as (bootstrapped) value function approximation error during early-stage training or biased/high-variance rollout samples, and large step size, etc. Therefore TRPO as a stable PG algorithm is a type of strict policy-iteration (PI) algorithm having the policy improvement property during each gradient update (like PI with exact value functions in mode-based cases).

Of course in practice guarantees of TRPO for the general state and action space case might still be shaky due to advantage or value function approximation errors even with the ideal bias-variance tradeoff GAE, and your other references talked about generalization of surrogate function to try to address its practical weakness further.

Trust region policy optimization guarantees for continuous case

1 Answers1

Linked