3

I am reading Sutton & Bartos's Book "Introduction to reinforcement learning". In this book, the defined the optimal value function as:

$$v_*(s) = \max_{\pi} v_\pi(s),$$ for all $s \in \mathcal{S}$.

Do we take the max over all deterministic policies, or do we also look at stochastic policies (is there an example where a stochastic policy always performs better than a deterministic one?)

My intuition is that the value function of a stochastic policy is more or less a linear combination of the deterministic policies it tries to model, however, there are some self-references, so it is not mathematically true).

If we do look over all stochastic policies, shouldn't we take the supremum? Or do we know, that the supremum is achieved, and therefore it is truly a maximum?

nbro
  • 42,615
  • 12
  • 119
  • 217
Tamar
  • 33
  • 3

1 Answers1

2

The value function is defined as $v_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$ where $G_t$ are the (discounted) returns from time step $t$. The expectation is taken with respect to the policy $\pi$ and the transition dynamics of the MDP.

Now, as you pointed out the optimal value function is defined as $v_*(s) = \max_\pi v_\pi(s)\; ; \;\forall s \in \mathcal{S}$. All we are doing here is choosing a policy $\pi$ that maximises the value function; this can be a deterministic or a stochastic policy, though intuitively it is likely to be deterministic unless for some states that are two (or more) actions with the same expected value, in which case you can take any of said actions with equal probability, thus making the policy stochastic.

For a finite MDP (which is what I assumed above too), we know that an optimal value function exists (this is mentioned in the book) so taking the maximum is fine here.

David
  • 5,100
  • 1
  • 11
  • 33