4

In slide 16 of his lecture 5 of the course "Reinforcement Learning", David Silver introduced GLIE Monte-Carlo Control.

enter image description here

But why is it an on-policy control? The sampling follows a policy $\pi$ while improvement follows an $\epsilon$-greedy policy, so isn't it an off-policy control?

nbro
  • 42,615
  • 12
  • 119
  • 217
fish_tree
  • 247
  • 2
  • 6

1 Answers1

4

In this case, $\pi$ has always been an $\epsilon$-greedy policy. In every iteration, this $\pi$ is used to generate ($\epsilon$-greedily) a trajectory from which the new $Q(s, a)$ values are calculated. The last line in the "pseudocode" tells you that the policy $\pi$ will be a new $\epsilon$-greedy policy in the next iteration. Since the policy that is improved and the policy that is sampled are the same, the learning method is considered an on-policy method.

If the last line was $\mu \leftarrow \epsilon\text{-greedy}(Q)$, it would be an off-policy method.

Hai Nguyen
  • 572
  • 5
  • 14