4

"If a model is not available, then it is particularly useful to estimate action values (the values of state-action pairs) rather than state values. With a model, state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state, as we did in the chapter on DP. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy."

The above extract is from Sutton and Barto's Reinforcement Learning, Section 5.2 - part of the chapter on Monte Carlo Methods.

Could someone please explain in some more detail, as to why it is necessary to determine the value of each action (i.e. state-values alone are not sufficient) for suggesting a policy in a model-free setting?


P.S.

From what I know, state-values basically refer to the expected return one gets when starting from a state (we know that we'll reach a terminal state, since we're dealing with Monte Carlo methods which, at least in the book, look at only episodic MDPs). That being said, why is it not possible to suggest a policy solely on the basis of state-values; why do we need state-action values? I'm a little confused, it'd really help if someone could clear it up.

stoic-santiago
  • 1,201
  • 9
  • 22

1 Answers1

4

why is it not possible to suggest a policy solely on the basis of state-values; why do we need state-action values?

A policy function takes state as an argument and returns an action $a = \pi(s)$, or it may return a probability distribution over actions $\mathbf{Pr}\{A_t=a|S_t=s \} =\pi(a|s)$.

In order to do this rationally, an agent needs to use the knowledge it has gained to select the best action. In value-based methods, the agent needs to identify the action that has the highest expected return. As an aside, whilst learning it may not take that action because it has decided to explore, but if it is not capable of even identifying a best action then there is no hope of it ever finding an optimal policy, and it cannot even perform $\epsilon$-greedy action selection, which is a very basic exploration approach.

If you use an action value estimate, then the agent can select the greedy action simply:

$$\pi(s) = \text{argmax}_a Q(s,a)$$

If you have state values, then the agent can select the greedy action directly only if it knows the model distribution $p(r,s'|s,a)$:

$$\pi(s) = \text{argmax}_a \sum_{r,s'}p(r,s'|s,a)(r + \gamma V(s'))$$

In other words, to find the best action to take the agent needs to look ahead a time step to find out what the distribution of next states would be following that action. If the only values the agent knows are state values, this is the only way the agent can determine the impact of any specific action.

Although there are alternatives to this specific equation, there is no alternative that does not use a model in some form. For instance, if you can simulate the environment, you could simulate taking each action in turn, and look over multiple simulation runs to see which choice ends up with the best $(r + \gamma V(s'))$ on average. That would be a type of planning, and perhaps the start of a more sophisticated approach such as MCTS. However, that simulation is a model - it needs access to the transition probabilities in some form in order to correctly run.

It is possible to have an entirely separate policy function that you train alongside a state value function. This is the basis of Actor-Critic methods, which make use of policy gradients to adjust the policy function, and one of the value-based methods, such as TD learning, to learn a value function that assists with calculating the updates to the policy function. In that case you would not be using a value-based method on its own, so the quote from that part of Sutton & Barto does not apply.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66