The definition (section 3.6 Barto Sutton) for the optimal policy states that $\pi > \pi'$ iff $v_{\pi}(s) > v_{\pi'}(s)$ for all $s \in S$.
I have difficulty understanding why the value (under the optimal policy) should be higher for every state. Suppose there's:
- Policy1 which gives a high value (say 10) for one particular state and very low values (say 1) for all other states.
- Another policy2 gives moderate values (say 3) for all states.
If the agent's path goes through the same 10 states, policy1 nets a return (no discounting) of 19 (10x1 + 1x9) and policy 2 gets 30 (3x10). The agent following policy2 will net a higher return than under policy1. Yet, the definition prevents policy2 from being chosen as the optimal over policy1.
What am I missing here? Can someone who understands better please give some tips.