12

Do off-policy policy gradient methods exist?

I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

nbro
  • 42,615
  • 12
  • 119
  • 217
echo
  • 713
  • 1
  • 6
  • 12

1 Answers1

7

Absolutely, it’s a really interesting problem. Here is a paper detailing off policy actor critic. This is important because this method can also support continuous actions.

The general idea of off-policy algorithms is to compare the actions performed by a behaviour policy (which is actually acting in the world) with the actions the target policy (the policy we want to learn) would have chosen. Using this comparison we can determine a ratio ($0 \leq \rho \leq 1$) which can scale the update to the target policy by the probability of the target policy taking that action. A higher $\rho$, the more alike the two policies are, and this increases the magnitude of the learning update for the target policy for that step. A $\rho$ of $0$, and the update is ignored.

nbro
  • 42,615
  • 12
  • 119
  • 217
Jaden Travnik
  • 3,867
  • 1
  • 18
  • 35