1

Vanilla policy gradient has a loss function:

$$\mathcal{L}_{\pi_{\theta}(\theta)} = E_{\tau \sim \pi_{\theta}}[\sum\limits_{t = 0}^{\infty}\gamma^{t}r_{t}]$$

while in TRPO it is:

$$\mathcal{L}_{\pi_{\theta_{old}}(\theta)} = \frac{1}{1 - \gamma}E_{s, a \sim \pi_{\theta_{old}}}[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}(a|s)}}A^{\pi_{\theta_{old}}}(s,a)]$$

There exists some problems of vanilla policy gradient, such that distance dismatch between parameter space and policy space and suffering poor sample efficiency.

For tackling this problem, TRPO introduces importance sampling for improving this. However, when I compared the pseudocode of two algorithms, I don't see any obvious evidence that aids the point. They all firstly sampled multiple trajectories under the current policy. The former then just uses the estimated gradient of loss function to update the parameter $\theta$. The latter also uses the estimated gradient but under the KL constraint to update the parameter $\theta$. It looks like they are almost the same procedure but with some subtle difference of the optimization process.

My question is - should it be more accurate to say that importance sampling just increases the stability of the sequential decision making process, or rather, avoid update too aggressively, but not directly improves the sample efficiency?

Pseudocode of two algorithms (source from OpenAI spinning up document)

enter image description here

enter image description here

nbro
  • 42,615
  • 12
  • 119
  • 217
Magi Feeney
  • 51
  • 1
  • 5

1 Answers1

2

The point of importance sampling is to use the same episode(s) to do multiple policy gradient updates. It will definitely increase sample efficiency over the strictly on-policy case, simply because in an off-policy algorithm we can use a single transition to do multiple updates, whereas we only get a single update from that same experience in the on-policy case.

First of all, in the pseudocode for TRPO that you give, I see no importance sampling being done, and the pseudocode makes it seem that the algorithm is being done on-policy.

Basically, an on-policy algorithm would look like:

\begin{align*} & \text{ for each iteration }: \\ & \qquad \text{ for t in range(n)}: \\ & \qquad \qquad \text{sample } a_t; \text{ get reward } r_t \text{ and next state } s_{t+1} \\ & \qquad \text{ do policy update based on } a_1, r_1, s_2, a_2, \ldots \end{align*}

whereas an off-policy algorithm is like

\begin{align*} & \text{ for each iteration }: \\ & \qquad \text{ for t in range(n)}: \\ & \qquad \qquad \text{sample } a_t; \text{ get reward } r_t \text{ and next state } s_{t+1} \\ & \qquad \text{ for m epochs}: \\ & \qquad \qquad \text{ for k mini-batches}: \\ & \qquad \qquad \qquad \text{make mini batch from} \{a_1, r_1, s_2, a_2, \ldots\} \text{ and do policy gradient update } \end{align*}

So, in an off-policy algorithm, you may have something like $m=10$ (10 epochs), meaning every transition you do gets used in 10 policy gradient updates. An on policy algorithm would only use that transition once.

nbro
  • 42,615
  • 12
  • 119
  • 217
Taw
  • 1,321
  • 5
  • 12