Could someone please help me gain some intuition as to why the optimal policy for a Markov Decision Process in the infinite horizon case (agent acts forever) is deterministic?
2 Answers
Suppose you learned your action-value function perfectly. Recall that the action-value function measures the expected return after taking a given action in a given state. Now, the goal when solving an MDP is to find a policy that maximizes expected returns. Suppose you're in state $s$. According to your action-value function, let's say actions $a$ maximizes the expected return. So, according to the goal of solving an MDP, the only action you would ever take from state $s$ is $a$. In other words $\pi(a'\mid s) = \mathbf{1}[a'=a]$, which is a deterministic policy.
Now, you might argue that your action-value function will never be perfect. However, this just means you need more exploration, which can manifest itself as stochasticity in the policy. However, in the limit of infinite data, the optimal policy will be deterministic since the action-value function will have converged to the optimum.
 
    
    - 1,166
- 1
- 8
- 6
The premise of this question is somewhat misleading. There is a deterministic optimal policy for a MDP, but this does not mean a stochastic optimal policy never exists. Talking about the optimal policy can be misleading, as there may be many different optimal policies.
For example, certainly we could imagine an MDP where $Q^*(s,a_0) = Q^*(s,a_1)$ for two different actions $a_1$ and $a_2$ that both maximize the optimal action-value function $Q^*$ at some state $s$. Then a stochastic policy choosing randomly between $a_1$ and $a_2$ at $s$ is optimal, but so is a deterministic policy that always picks $a_1$ at $s$, and a deterministic policy that always picks $a_2$ at $s$.
 
    
    - 594
- 2
- 10
 
    