2

In Sutton & Barto's Reinforcement Learning: An Introduction page 63 the authors introduce the optimal state value function in the expression of the optimal action-value function as follows: $q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s, A_{t}=a], \forall s \in S, \forall a \in A$.

I don't understand what $v_{*}(S_{t+1})$ could possibly mean since $v_{*}$ is a mapping, under the optimal policy $\pi_{*}$, from states to numbers which are expected returns starting from those states and at different time steps.

I believe that the authors use the same notation to denote the state-value function $v$ that verify $v(s)=\mathbb{E}[G_{t}|S_{t}=s], \forall s \in S$ and the random variable $\mathbb{E}[G_{t+1}|S_{t+1}]$ but I'm not sure.

Daviiid
  • 585
  • 5
  • 17

1 Answers1

3

I am not sure if it is standard notation, but Sutton & Barto use a convention that a function of a random variable is a new random variable that maps between values of the old variable to values of the new one using the function, and without affecting probability distribution (other than the function could be one-way hence probabilties may effectively combine e.g. if there were several states with $v_*(s) = 5$).

Given this convention then $v_*(S_{t+1})$ is a random variable of the optimal state value functions of possible statues at time step $t+1$. That is, it has the same probability densities, based on policy and state transition rules, as $S_{t+1}$, but has the associated values of $v_*(s)$ for each possible $S_{t+1}$.

The actual distribution of $v_{*}(S_{t+1})$ will vary depending on the conditions in the context where it is evaluated.

If you resolve the expectations in the first equation, which has conditions on $S_t$ and $A_t$:

$q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s, A_{t}=a]$

$\qquad\quad= \sum_{r,s'} p(r,s'|s,a)(r + \gamma v_*(s'))$

. . . which expresses action value $q_*(s,a)$ in terms of the state transition rules, immediate reward function and the state value $v_*(s')$ one half-step ahead. That is, at the next state, but before the next (optimal choice) action is taken.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66