2

One can easily retrieve the optimal policy from the action value function but how about obtaining it from the state value function?

Mika
  • 371
  • 2
  • 10

1 Answers1

2

You can obtain the optimal policy from the optimal state value function if you also have the state transition and reward model for the environment $p(s',r|s,a)$ - the probability of receiving reward $r$ and arriving in state $s'$ when starting in state $s$ and taking action $a$.

This looks like:

$$\pi^*(s) = \text{argmax}_a [\sum_{s',r} p(s',r|s,a)(r + \gamma v^*(s'))]$$

There are variations of this function, depending on how you represent knowledge of the environment. For instance, you don't actually need the full distribution model for reward, an expected reward function and separate distribution model for state transition rules would also work.

Without at least an approximate model of the environment, you cannot derive a policy from state values. If all you have is state values, then to pick an optimal action, you absolutely need the ability to look ahead a time step at what the next state might be for each action choice.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66