2

In section $1.5$ of the book "Reinforcement Learning: An Introduction" by Sutton and Barto they use tic-tac-toe as an example of an RL use case. They provide the following temporal difference update rule in that section: $$ V(S_{t}) \leftarrow V(S_t) + \alpha[V(S_{t+1}) - V(S_t)] \tag{1} $$ However, in the chapter on TD methods they state that the "simplest TD method makes the update": $$ V(S_{t}) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] \tag{2} $$

Does any one have an explanation for where $(1)$ comes from? I can't see it to be equivalent to $(2)$.

I am also wondering if anyone has a way of relating this update rule to dynamic programming. Is this supposed to be some sort of approximation of value iteration when the environments dynamics, $p(s', r | s, a)$, is unknown?

nbro
  • 42,615
  • 12
  • 119
  • 217
mNugget
  • 167
  • 4

1 Answers1

4

The equation (1) you mentioned is a simplified form of the temporal difference TD(0) update rule, specifically for the case of episodic tasks where there's no discounting $\gamma$ and the only reward is at the episodic terminating state such as tic-tac-toe. Equation (2) is a more general form applicable to both episodic and continuing tasks, considering future rewards through discounting which is the one typically used in RL TD learning methods.

As for the connection to value iteration of dynamic programming, it seems not much since value iteration update rule is based on Bellman optimality operator of the Bellman optimality equation (BOE) whose convergence is based on contraction mapping fixed point theorem, while the convergence of TD learning is based on Dvoretzky’s convergence theorem of stochastic approximation.

cinch
  • 11,000
  • 3
  • 8
  • 17