1

Here, https://lrscy.github.io/2020/07/09/Coursera-Reinforcement-Learning-Course2-Week2-Notes

(See the "Monte Carlo Control" and then "Solutions of Two Assumptions" sections)

two approaches in solving "Infinite number of episodes" in Monte Carlo Control with exploring starts are given as follows:

(i) One is to hold firm to the idea of approximating $q_{\pi_k}$ in each policy evaluation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

(ii) Another one is similar to the idea of GPI. On each evaluation step we move the value function toward $q_{\pi_k}$ , but we do not expect to actually get close except over many steps. One extreme form of the idea is to alternatively apply policy improvement and policy evaluation.

I did not understand the difference between (i) and (ii). For me, (i) is also a form of GPI.

DSPinfinity
  • 1,223
  • 4
  • 10

1 Answers1

2

This lingering question is closely related to your recent question how to derive value iteration from policy iteration. Policy evaluation (PE) required in policy iteration method needs exact or approximated state or action value evaluation during each iteration, since it's based on the idea of policy improvement theorem (PIT) and Bellman equation, so it's the original and pure (not generalized) policy iteration. And you're right it could be said as a form of GPI.

On the other extreme end, value iteration is based on the idea of solving Bellman Optimal equation, on each policy update (PU) step we move the value function toward the exact $q_{\pi_k}$, but we do not expect to actually get close except over many steps. Therefore it's a kind of generalized policy iteration.

cinch
  • 11,000
  • 3
  • 8
  • 17