The original TRPO paper stated an algorithm that used optimization of the following surrogate objective:
$$ L_\pi(\tilde{\pi})=\eta(\pi)+\sum_s \rho_\pi(s) \sum_a \tilde{\pi}(a \mid s) A_\pi(s, a) $$
where $\eta$ is the value function, $\rho_\pi$ is the unnormalized discounted visitation frequency, $\pi, \tilde \pi$ are the old, resp., new policy, $A_\pi$ is the advantage of action $a$ over policy $\pi$. The objective $L_\pi$ is a local approximation to the value under the new policy.
The key theoretical result of the original paper, and seemingly all of the TRPO theory is the value comparison:
$$ \eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) $$
where $C$ is a constant that depends on the discount factor and maximum advantage over old policy, and $D_{\mathrm{KL}}^{\max }$ is the state-maximized KL-divergence between the old and new policy.
Later papers (like this one) pose the surrogate objective in the form like:
$$ L_{\pi}(\tilde \pi)= \mathbb E_{\pi} \bigg[ \prod_{t=0}^{T-1} \frac{\tilde \pi\left(a_t \mid s_t\right)}{\pi\left(a_t \mid s_t\right)} \sum\limits_{t = 0}^{T-1}\gamma^{t} A_\pi(s_t, a_t) \bigg] $$
where $T$ is a mini-batch length.
This expression actually is nothing else but the value difference (as per formula (1) in the original paper) with importance sampling applied.
My questions are:
What is the relation between the two surrogate objectives? I don't seem find where it is explained
Is there any formal value comparison result for the type of the algorithm where the second surrogate objective is used?