0

There are two types of value functions in reinforcement learning: State value function $V^{\pi} (s)$, state-action value function $Q^{\pi}(s, a)$.

State value function:

This value tells us how good to be in state $s$ if we are following policy $\pi$. Formally, it can be defined as the average returns obtained at time step $t$ from state $s$ if we follow policy $\pi$.

$$V^{\pi}(s) = \mathbb{E}_{\pi}[R_{t}|s_t = s] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s\right] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a \right]$$

State-action value function:

This value tells us how good is to to perform action $a$ in state $s$ if we are following policy $\pi$. Formally, it can be defined as the average returns obtained at time step $t$ from state $s$ and action $a$ if we follow policy $\pi$ further.

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi}[R_{t}|s_t = s, a_t = a] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a\right] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a \right]$$

Now, Q-learning and SARSA learning algorithms are generally used to update $Q$ function under policy $\pi$ using the following recurrences respectively

$$Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma \max\limits_{a} Q(s_{t+1},a) - Q(s_t,a_t)] $$

$$Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)] $$

Now my doubt is about the recurrence relations in Temporal Difference (TD) algorithms that update state value functions. Are they same as the recurrences provided above?

$$V(s_t) = V(s_t) + \alpha[r_{t+1} + \gamma \max V(s_{t+1}) - V(s_t)] $$

$$V(s_t) = V(s_t) + \alpha[r_{t+1} + \gamma V(s_{t+1}) - V(s_t)] $$

If yes, what are the names of the algorithms that uses these recurrences?

hanugm
  • 4,102
  • 3
  • 29
  • 63

0 Answers0