6

In the context of reinforcement learning, a policy, $\pi$, is often defined as a function from the space of states, $\mathcal{S}$, to the space of actions, $\mathcal{A}$, that is, $\pi : \mathcal{S} \rightarrow \mathcal{A}$. This function is the "solution" to a problem, which is represented as a Markov decision process (MDP), so we often say that $\pi$ is a solution to the MDP. In general, we want to find the optimal policy $\pi^*$ for each MDP $\mathcal{M}$, that is, for each MDP $\mathcal{M}$, we want to find the policy which would make the agent behave optimality (that is, obtain the highest "cumulative future discounted reward", or, in short, the highest "return").

It is often the case that, in RL algorithms, e.g. Q-learning, people often mention "policies" like $\epsilon$-greedy, greedy, soft-max, etc., without ever mentioning that these policies are or not solutions to some MDP. It seems to me that these are two different types of policies: for example, the "greedy policy" always chooses the action with the highest expected return, no matter which state we are in; similarly, for the "$\epsilon$-greedy policy"; on the other hand, a policy which is a solution to an MDP is a map between states and actions.

What is then the relation between a policy which is the solution to an MDP and a policy like $\epsilon$-greedy? Is a policy like $\epsilon$-greedy a solution to any MDP? How can we formalise a policy like $\epsilon$-greedy in a similar way that I formalised a policy which is the solution to an MDP?

I understand that "$\epsilon$-greedy" can be called a policy, because, in fact, in algorithms like Q-learning, they are used to select actions (i.e. they allow the agent to behave), and this is the fundamental definition of a policy.

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

5

for example, the "greedy policy" always chooses the action with the highest expected return, no matter which state we are in

The "no matter which state we are in" there is generally not true; in general, the expected return depends on the state we are in and the action we choose, not just the action.

In general, I wouldn't say that a policy is a mapping from states to actions, but a mapping from states to probability distributions over actions. That would only be equivalent to a mapping from states to actions for deterministic policies, not for stochastic policies.

Assuming that our agent has access to (estimates of) value functions $Q(s, a)$ for state-action pairs, the greedy and $\epsilon$-greedy policies can be described in precisely the same way.

Let $\pi_g (s, a)$ denote the probability assigned to an action $a$ in a state $s$ by the greedy policy. For simplicity, I'll assume there are no ties (otherwise it would in practice be best to randomize uniformly across the actions leading to the highest values). This probability is given by:

$$ \pi_g (s, a) = \begin{cases} 1, & \text{if } a = \arg\max_{a'} Q(s, a') \\ 0, & \text{otherwise} \end{cases} $$

Similarly, $\pi_{\epsilon} (s, a)$ could denote the probability assigned by an $\epsilon$-greedy strategy, with probabilities given by:

$$ \pi_{\epsilon} (s, a) = \begin{cases} (1 - \epsilon) + \frac{\epsilon}{\vert \mathcal{A}(s) \vert}, & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\epsilon}{\vert \mathcal{A}(s) \vert}, & \text{otherwise} \end{cases} $$ where $\vert \mathcal{A}(s) \vert$ denotes the size of the set of legal actions in state $s$.

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70