3

In the TRPO and PPO papers, it is mentioned that large policy updates often lead to performance drops in policy gradient methods.

By "large policy updates," they mean a significant KL divergence between subsequent policies, not necessarily a large difference in the policies' function parameters.

I have trouble understanding why this would be the case. If we've conducted proper exploration in the current iteration and our value function is precise enough, it seems logical that updating our policy towards more optimal actions should lead to an improvement, even if this update is substantial.

One disadvantage I see is that a large policy update could disrupt our value function, making the advantages in the next iteration less precise. However, in this scenario, we could allocate more computational resources to learning a new value function before updating our policy, as in the standard policy evaluation + improvement approach.

So, my question is: What are the reasons for preventing big policy updates?

Druudik
  • 191
  • 8

1 Answers1

3

One prominent reason for preventing large policy updates has to do with the architecture of deep reinforcement learning algorithms: since parameters are shared by all inputs in a neural network, updating the policy at certain states may also change the policy significantly at other states. Larger updates generally will result in larger corresponding changes. Once in a while, the change in policy at a state may unintentionally be so large that the algorithm's entire new policy will suffer and yield low expected return.

The above summary may seem like a slippery slope argument, but from what I have encountered in practice and from what I've read (including the two papers you cited), this decline in performance is a very real phenomenon and happens more often than a practitioner would hope. For example, there may be a state that has high influence on the expected return of a policy (i.e. a critical state), and an unexpected change of policy at that state may immediately cause performance decline of the algorithm's entire policy until the better policy at that state can be relearned. This performance decline phenomenon can also be intensified by other challenges of reinforcement learning. For example, the credit assignment problem can make it harder to determine which states and actions are contributing to the low expected return, especially in the presence of sparse rewards. Furthermore, exploration hyperparameters may have been annealed before the performance decline took place, making the testing of the unlearned actions less frequent.

There may be other reasons for the performance decline depending on the environment, but the reason described above is the one I have most commonly seen in my own work and the literature.

DeepQZero
  • 1,733
  • 1
  • 10
  • 36