Why is the better policy defined with respect to all the states values being greater?

Question

In Sutton & Barto (Section 3.6 - Optimal Policies and Optimal Value Functions), they say that :

Value functions define a partial ordering over policies. A policy $\pi$ is defined to be better than or equal to a policy $\pi^{'}$ if its expected return is greater than or equal to that of $\pi^{'}$ for all states. In other words, $\pi \ge \pi^{'}$ if and only if $v_\pi(s) \ge v_\pi^{'}(s)$ for all $s \in \mathbb{S}$.

My question is, why is the better policy defined with respect to all the states values being correspondingly greater instead of a combined metric of all state values of a policy?

If there is a policy that gives me the highest rewards on 99 out of 100 states, but gives me a lower reward on the last state compared to a second policy (which performs poorly on the other 99 states), would this first policy not be considered an optimal policy according to the definition above?

score 2 · Accepted Answer · answered Mar 31 '24 at 10:54

No, the answer of @foreverska is wrong, otherwise they would have said “better givena specific $\mu(s)$”. The reason is simply that given 2 policies, where one performs better than the other only in a subset of states, then you can create a strictly better third policy combining the two, where you consider the first one in the states where it outperforms the second one, and the other way around

in this case, the two initial ones are not in any specific order, but the third one (the combined one) has a order wrt the initial two

foreverska · Answer 2 · 2024-03-30T15:54:50.233

If the subsequent policy is better in 99 out of 100 states this would seem to be better. But imagine if the next policy in the iteration is also better in 99 out of 100 states but the one state is a different state than the first. This third policy is worse than the original in two states (if the increase in the first bad state is less than the original decrease). This pattern can be repeated until we arrive at a maximally bad policy. A maximally bad policy is likely not greater than or equal to the first policy in this case and thus the chain of greater than or equal to must be broken somewhere.

More practically, the 99 out of 100 example does not specify how often that one state occurs. If it is the state the agent finds itself in 99.9% of the time, this can quickly make a worse total reward for the agent.

This is why it is necessary to be better or equal to the previous policy in all states. A policy which is better in 99 out of 100 states may achieve a higher reward for the agent and thus be considered better. But the definition given by S&B points forward toward policy iteration and the guarantee that it achieves an optimal policy so it is a bit more strict.

Why is the better policy defined with respect to all the states values being greater?

2 Answers2