4

The definition (section 3.6 Barto Sutton) for the optimal policy states that $\pi > \pi'$ iff $v_{\pi}(s) > v_{\pi'}(s)$ for all $s \in S$.

I have difficulty understanding why the value (under the optimal policy) should be higher for every state. Suppose there's:

  1. Policy1 which gives a high value (say 10) for one particular state and very low values (say 1) for all other states.
  2. Another policy2 gives moderate values (say 3) for all states.

If the agent's path goes through the same 10 states, policy1 nets a return (no discounting) of 19 (10x1 + 1x9) and policy 2 gets 30 (3x10). The agent following policy2 will net a higher return than under policy1. Yet, the definition prevents policy2 from being chosen as the optimal over policy1.

What am I missing here? Can someone who understands better please give some tips.

ahron
  • 265
  • 2
  • 7
  • Another thing, for anyone having the same confusion, is to remember that the value of each state accounts for the (expected) reward from all subsequent states. – ahron Apr 21 '25 at 05:13

1 Answers1

4

In RL the definition of policy optimality is not about the return along some particular trajectory, but is about guaranteeing the best expected outcome expressed via state value functions in your case here no matter where the process starts. So policy optimality is defined pointwise not aggregatedwise, meaning that for a policy $π^∗$ to be considered optimal, it must yield a value that is as high as or higher than any other policy’s value for every single state to have a uniform dominance.

If a policy does very well on one state but poorly on others, then it might yield a higher total reward on some special trajectories, but it is not robustly optimal. In RL we typically want a policy that guarantees the best performance from every state, even if that means sacrificing some performance in states where one policy could get an extremely high reward.

cinch
  • 11,000
  • 3
  • 8
  • 17