5

In value iteration, we have a model of the environment's dynamics, i.e $p(s', r \mid s, a)$, which we use to update an estimate of the value function.

In the case of temporal-difference and Monte Carlo methods, we do not use $p(s', r \mid s, a)$, but then how do these methods work?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

5

The main idea is that you can estimate $V^\pi(s)$, the value of a state $s$ under a given policy $\pi$, even if you don't have a model of the environment, by visiting that state $s$ and following the policy $\pi$ after that state. If you repeat this process many times, you'll get many samples of trajectories starting at $s$ with some total return associated with them. If you average them, you'll have a sample-based estimate of $V^\pi(s)$. In a similar way, you can do a sample-based approach to estimate $Q^\pi(s, a)$, where you start at $s$, take some arbitrary action $a$, and then continue following $\pi$. After you collect many samples, you can also average the returns to get a sample-based estimate of $Q^\pi(s, a)$. Once you have a good enough estimate of $Q^\pi(s, a)$, you can see that it's easy to improve the current policy $\pi$ to get a new policy $\pi'$ that always chooses the best action on any given state: $\pi'(s) = \underset{a}{argmax}\ Q^\pi(s, a)$.

The procedure I described in the last paragraph where you sample an entire trajectory and wait until the end of the episode to estimate a return is the Monte Carlo approach. In contrast, TD exploits the recursive nature of the Bellman equation to learn as you go, even before the episode ends. For example, you can estimate the value function like this: $V^\pi(s_0) = r_0 + \gamma V^\pi(s_1)$, where $r_0$ comes from experience and instead of collecting more $r_t$ until the end of the episode, you rely on your estimate of the value of the next state (which you also get from experience). At the beginning, your estimates might be random, but after many iterations the process converges to good estimates. The same idea of exploiting this recursive structure can be used to estimate $Q^\pi(s, a)$. I guess this approach is known as temporal difference because you sample $r_0$ which is in some sense the temporal difference between $s_0$ and $s_1$.

Mei Zhang
  • 202
  • 1
  • 6