3

Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?

nbro
  • 42,615
  • 12
  • 119
  • 217
apuffin
  • 41
  • 2

1 Answers1

1

It's because, in the actor-critic algorithm, the objective function is an expectation under the $\tau$ of the policy. If we want to use off-policy data, we have to resort to importance sampling relative to the other policy.

nbro
  • 42,615
  • 12
  • 119
  • 217
apuffin
  • 41
  • 2