In the TRPO and PPO papers, it is mentioned that large policy updates often lead to performance drops in policy gradient methods.
By "large policy updates," they mean a significant KL divergence between subsequent policies, not necessarily a large difference in the policies' function parameters.
I have trouble understanding why this would be the case. If we've conducted proper exploration in the current iteration and our value function is precise enough, it seems logical that updating our policy towards more optimal actions should lead to an improvement, even if this update is substantial.
One disadvantage I see is that a large policy update could disrupt our value function, making the advantages in the next iteration less precise. However, in this scenario, we could allocate more computational resources to learning a new value function before updating our policy, as in the standard policy evaluation + improvement approach.
So, my question is: What are the reasons for preventing big policy updates?