4

I was going through Sutton's book and, using sample-based learning for estimating the expectations, we have this formula

$$ \text{new estimate} = \text{old estimate} + \alpha(\text{target} - \text{old estimate}) $$

What I don't quite understand is why it's called the target, because since it's the sample, it’s not the actual target value, so why are we moving towards a wrong value?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

6

It is our "current" target. We assume that the value we get now is at least a closer approximation to the "true" target.

We're not so much moving towards a wrong value as we are moving away from a more wrong value.

Of course, it is all base on random trials, so saying anything definite (such as: "we are guaranteed to improve at each step") is hard to show without working probabilistically. The expectation of the error of the value function (as compared to the true value function) will decrease, that is all we can say.

0

It would be heplful for me if you specify the section and page number of the Sutton's book. But as far as I understand your question I will try explain this. Think of TD update. The sample contains $(s_t,a_t,r_{t+1},s_{t+1})$. Using incremental update we can write: $$ v_{t}(s) = \frac{1}{t} \sum_{j=1}^{t}(r_{t+1} + \gamma v_{s_{t+1}})$$ $$ v_{t}(s) = v_{t-1}(s) + \alpha (r_{t+1} + \gamma v_{t-1}(s_{t+1}) - v_{t-1}(s_t))$$ We are calling this $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ as the TD target. From the above equation you can already see that $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ is actually the unbaised estimate for $v(s)$. We are calling $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ an unbiased estimate since $E[r_{t+1} + \gamma v_{t-1}(s_{t+1})] = v_t(s_t)$. That means expectation over $r_{t+1} + \gamma v_{t-1}(s_{t+1})$ lead us to true state value function, $v_t(s)$.

For the monte carlo update same explain will be applied. I hope that this answer your question.

Swakshar Deb
  • 703
  • 4
  • 12