1

By substituting the optimal policy $\pi_{\star}$ into the Bellman equation, we get the Bellman equation for $v_{\pi_{\star}}(s)=v_{\star}(s)$:

$$ v_{\star}(s) = \sum\limits_a \pi_{\star}(a|s) \sum\limits_{s'} \sum_r p(s', r | s, a)[r + \gamma v_{\star}(s')]$$

From the above equation, how can we obtain the this one?

$$ v_{\star}(s) = \max\limits_a \sum\limits_{s'} \sum\limits_r p(s', r|s,a) [r + \gamma v_{\star}(s')]$$

nbro
  • 42,615
  • 12
  • 119
  • 217
DSPinfinity
  • 1,223
  • 4
  • 10

1 Answers1

3

Your first equation is the definition of any state value function, so it must also be definition of the optimal state value function $v_*$.

The second equation is the definition of $v_*$ in terms of the state-action value function $\color{green}{q_*}$.

In reality, the first equation is also the definition of $v_*$ in terms of $\color{green}{q_*}$, which is what you want to see :)

First, note that

\begin{align} v_\pi(s) &= \sum_a \pi(a \mid s) \color{blue}{\sum_{s'} \sum_r p(s', r|s, a) [r + \gamma v_\pi(s')]} \\ &= \sum_a \pi(a \mid s) \color{blue}{q_\pi(s, a)} \tag{1}\label{1}. \end{align} Now, we denote the optimal state value function as $v_{\pi_*} = v_*$. If we plug $v_*$, $\pi_*$ and $\color{green}{q_*}$ in the equation above, we get your first equation, but we write it as

\begin{align} v_*(s) &= \sum_a \pi_*(a \mid s) \color{green}{q_*(s, a)} \tag{2}\label{2}. \end{align} For finite MDPs, the optimal policy is deterministic, i.e. it chooses one action (the optimal one) with probability $1$, so that means that \ref{2} can be written as

\begin{align} v_*(s) &= 0 \color{green}{q_*(s, a_1)} + \cdots + 1 \color{green}{q_*(s, a_*)} + 0 \color{green}{q_*(s, a_N)} \\ &= \color{green}{q_*(s, a_*)} \\ &= \max_a \color{green}{q_*(s, a)} \\ &=\max_a \color{green}{ \sum_{s'} \sum\limits_r p(s', r|s,a) [r + \gamma v_{*}(s')]} \tag{3}\label{3}, \end{align} where $a_* = \text{argmax}_a\pi_*(a \mid s) = \text{argmax}_a \color{green}{q_*(s, a)}$ is the optimal action. By definition, the optimal action in state $s$ is the one that leads to the highest expected return. See also this answer.

Finally, note that optimal value functions are unique for finite MPDs, so $\color{green}{q_*}$ and $v_*$ are unique.

nbro
  • 42,615
  • 12
  • 119
  • 217