4

In chapter 4.1 of Sutton's book, the Bellman equation is turned into an update rule by simply changing the indices of it. How is it mathematically justified? I didn't quite get the initiation of why we are allowed to do that?

$$v_{\pi}(s) = \mathbb E_{\pi}[G_t|S_t=s]$$

$$ = \mathbb E_{\pi}[R_{t+1} + \gamma G_{t+1}|S_t=s]$$

$$= \mathbb E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t=s]$$

$$ = \sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r+ \gamma v_{\pi}(s')]$$

from which it goes to the update equation:

$$v_{k+1}(s) = \mathbb E_{\pi}[R_{t+1} + \gamma v_{k}(S_{t+1})|S_t=s]$$

$$=\sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)[r+ \gamma v_{k}(s')]$$

nbro
  • 42,615
  • 12
  • 119
  • 217
Saeid Ghafouri
  • 125
  • 1
  • 6

3 Answers3

5

Why are we allowed to convert the Bellman equations into update rules?

There is a simple reason for this: convergence. The same chapter 4 of the same book mentions it. For example, in the case of policy evaluation, the produced sequence of estimates $\{v_k\}$ is guaranteed to converge to $v_\pi$ as $k$ (i.e. the number of iterations) goes to infinity. There are other RL algorithms that are also guaranteed to converge (e.g. tabular Q-learning).

To conclude, in many cases, the update rules of simple reinforcement learning (or dynamic programming) algorithms are very similar to their mathematical formalization because algorithms based on those update rules are often guaranteed to converge. However, note that many more advanced reinforcement learning algorithms (especially, the ones that use function approximators, such as neural networks, to represent the value functions or policies) are not guaranteed or known to converge.

nbro
  • 42,615
  • 12
  • 119
  • 217
1

To me, Bellman update is simply supervised learning: right hand side (bootstrap) is a sample of the left hand side (conditional expectation).. The Bellman equation simply explains that the right hand side is such a sample.

kaiwenw
  • 161
  • 3
0

You're asking why the finite horizon policy evaluation converges to the infinite right?

Since the total reward is bounded(by the discount factor) you know that you can make your finite horizon policy evaluation get arbitrarily close to it in a finite number of steps.

People praise Bartos book but I find it annoying to read as he's not formal enough with mathematics.

FourierFlux
  • 847
  • 1
  • 7
  • 17