The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:
$$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$
where $\eta$ is the value function, $\rho_\pi$ is the unnormalized discounted visitation frequency, $\pi, \tilde \pi$ are the old, resp., new policy, $A_\pi$ is the advantage of action $a$ over policy $\pi$. The objective $L_\pi$ is a local approximation to the value under the new policy.
The key theoretical result of the original paper, and seemingly all of the TRPO theory is the value comparison:
$$ \eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) $$
where $C$ is a constant that depends on the discount factor and maximum advantage over old policy, and $D_{\mathrm{KL}}^{\max }$ is the state-maximized KL-divergence between the old and new policy.
With this result at hand (maximizing the surrogate objective s.t. a KL-divergence constraint), it's straightforward to show non-decrease of value under TRPO.
The proofs go along two routes:
- through $\alpha$-coupling of policies
- through perturbation theory
Both make quite explicit use of finiteness of the state space.
As Schulman et al. claimed in the paper, the performance difference result for the case of continuous state and action spaces is essentially the same substituting sums for integrals
I fail to see that and would appreciate a clarification.
In fact, it's even more complicated as practical surrogate objectives used in TRPO, PPO, do not utilize visitation measures and render objectives differently for which in turn the said guarantees do not extend trivially.
Recent papers like this one and this one made claims that guarantees of TRPO for the general state and action space case were kinda shaky in earlier papers.
What they suggest is to force convexity via regularization.