6

One of my friends and I were discussing the differences between Dynamic Programming, Monte-Carlo, and Temporal Difference (TD) Learning as policy evaluation methods - and we agreed on the fact that Dynamic Programming requires the Markov assumption while Monte-Carlo policy evaluation does not.

However, he also pointed out that Temporal Difference Learning cannot handle non-Markovian domains, i.e. it depends on the Markov assumption. Why is it so?

The way I understand it, the TD learning update is, in essence, the same as the Monte-Carlo update, except for the fact that the return instead of being calculated using the entire trajectory, is bootstrapped from the previous estimate of the value function, i.e. we can update the value as soon as we encounter a $(s,a,r,s')$ tuple, we don't have to wait for the episode (if finite) to terminate.

Where is the Markov assumption being used here, i.e the future is independent of the past given the present?

nbro
  • 42,615
  • 12
  • 119
  • 217
stoic-santiago
  • 1,201
  • 9
  • 22

1 Answers1

7

The Markov assumption is used when deriving the Bellman equation for state values:

$$v(s) = \sum_a \pi(a|s)\sum_{r,s'} p(r,s'|s,a)(r + \gamma v(s'))$$

One requirement for this equation to hold is that $p(r,s'|s,a)$ is consistent. The current state $s$ is a key argument of that function. There is no adjustment for history of previous states, actions or rewards. This is the same as requiring the Markov trait for state, i.e. that $s$ holds all information necessary to predict outcome probabilities of the next step.

The one step TD target that is sampled in basic TD learning is simply the inner part of this:

$$G_{t:t+1} = R_{t+1} + \gamma \hat{v}(S_{t+1})$$

which when sampled is equal to $v(s)$ in expectation *, when $S_t = s$. That is, when you measure a single instance of the TD target and use it to update a value function, you implicitly assume that the values or $r_{t+1}$ and $s_{t+1}$ that you observed occur with probabilities determined by $\pi(a|s)$ and $p(r,s'|s,a)$ as shown by the Bellman equation.

So the theory behind TD learning uses the Markov assumption, otherwise the sampled TD targets would be incorrect.

In practice you can get away with slightly non-Markov environments - most measurements of state for machinery are approximations that ignore details at some level, for instance, and TD learning can solve optimal control in many robotics environments. However, Monte Carlo methods are more robust against state representations that are not fully Markov.


* Technically this sample is biased because $\hat{v}(S_{t+1})$ is not correct when learning starts. The bias reduces over time and multiple updates. So the expected value during learning is approximately the same as the true value as shown by the Bellman equation.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66