1

What strategies are there to reduce the variance of the policy gradient estimator of the REINFORCE algorithm?

I know one possibility is to subtract a baseline as a running average of rewards from past mini-batches. Another is to compute the mean and variance of each trajectory over one mini-batch and standardise the values. A third one is to use large batch sizes.

What is considered the most effective? What other methods are there?

nbro
  • 42,615
  • 12
  • 119
  • 217

0 Answers0