0

I am trying to solve the following exercise from Sutton and Barto:

Sutton and Barto Exercise 3.27 Give an equation for $\pi_*$ in terms of $q_*(s,a)$

However, I am struggling to do so. I know that $\pi_*$ is the policy which will pick the action with highest return, given we know what the optimal action values are. So intuitively, I would express the optimal policy like this: $$\pi_*(\text{argmax}_a q_*(s,a)|s) = 1$$. To express it like this: $$\pi_* = \text{argmax}_a q_*(s,a) $$

seems wrong since $\pi_*$ is a probability. What am I not getting correct here?

nbro
  • 42,615
  • 12
  • 119
  • 217
user
  • 195
  • 9

1 Answers1

2

An optimal policy is just a greedy policy with respect to the optimal state-action value function (which is unique for a given MDP). So, $\pi_* = \text{argmax}_a q_*(s,a) $ is almost correct - it should have been

$$\pi_*(s) = \text{argmax}_a q_*(s,a), \forall s.$$

In this case, $\pi_*$ is a decision rule or a function.

If you define $\pi^*$ as a probability distribution, then you can do something like this

$$ \pi_*(s, a)= \begin{cases} 1, \text{if } a = \text{argmax}_a q_*(s,a)\\ 0, \text{otherwise}\end{cases}, $$ $\forall s \in \mathcal{S}$.

If $\text{argmax}_a q_*(s,a)$ is a set, you can choose any of the actions in that set with any rule you want, and, of course, still ignore non-optimal actions.

This is true for finite MDPs (aka MDPs with finite state, action and reward spaces).

There's a result (Putterman, 1994) that states that there's a deterministic and Markovian optimal policy for an MDP. So, optimal policies aren't (necessarily) probability distributions. It's a matter of convention and convenience.

nbro
  • 42,615
  • 12
  • 119
  • 217