Confusion between versions of trust-region policy optimziation

Question

The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:

$$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$

where $\eta$ is the value function, $\rho_\pi$ is the unnormalized discounted visitation frequency, $\pi, \tilde \pi$ are the old, resp., new policy, $A_\pi$ is the advantage of action $a$ over policy $\pi$. The objective $L_\pi$ is a local approximation to the value under the new policy.

The key theoretical result of the original paper, and seemingly all of the TRPO theory is the value comparison:

$$ \eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) $$

where $C$ is a constant that depends on the discount factor and maximum advantage over old policy, and $D_{\mathrm{KL}}^{\max }$ is the state-maximized KL-divergence between the old and new policy.

Later papers (like this one) pose the surrogate objective in the form like:

$$ L_{\pi}(\tilde \pi)= \mathbb E_{\pi} \bigg[ \prod_{t=0}^{T-1} \frac{\tilde \pi\left(a_t \mid s_t\right)}{\pi\left(a_t \mid s_t\right)} \sum\limits_{t = 0}^{T-1}\gamma^{t} A_\pi(s_t, a_t) \bigg] $$

where $T$ is a mini-batch length.

This expression actually is nothing else but the value difference (as per formula (1) in the original paper) with importance sampling applied.

My questions are:

What is the relation between the two surrogate objectives? I don't seem find where it is explained
Is there any formal value comparison result for the type of the algorithm where the second surrogate objective is used?

cinch · Answer 1 · 2025-01-13T08:46:41.750

In TRPO the expected objective is taken wrt the state visitation distribution under the old policy, thus ensuring that the policy optimization step doesn't diverge too far, while GAE objective expectation is taken wrt the trajectory distribution under the old policy adjusted by every trajectory's importance weight. Pragmatically the second objective avoids the need to explicitly estimate state visitation frequencies which are computationally challenging or intractable for very large high-dimensional state spaces. By using trajectory-level formulation and importance reweighting, it's much empirically easier to implement based on rollouts to evaluate advantage function with GAE for the best automatic bias-variance trade-off. That's why most later papers reference the second objective. In theory both papers essentially aim to optimize the same objective as shown in below proof sketch.

Note $ρ_π(s)$ is the stationary distribution of the MDP induced Markov chain under the old policy $π$, so the first surrogate objective is off-policy in nature during each policy iteration step which computes the expected advantage based on how frequently states are visited under $π$ but with actions drawn from the new policy $\tilde{π}$. Now the first objective is essentially to maximize the expectation of the advantage function $A_π(s,a)$ wrt $(s \sim ρ_π(s), a \sim \tilde{π}(a|s))$. Then by change of variable, you reformulate the same expectation wrt trajectory $\tau \sim \tau(π)$ under the old policy rather than $(s,a)$. Now you arrive at the second surrogate objective, and importance sampling trick is required because $a \sim \tilde{π}(a|s)$.

Confusion between versions of trust-region policy optimziation

1 Answers1

Linked