How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?

Question

In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following:

The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability $\epsilon$ they instead select an action at random. That is, all nongreedy actions are given the minimal probability of selection, $\frac{\epsilon}{|\mathcal{A}|}$, and the remaining bulk of the probability, $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$, is given to the greedy action.

I understood the probability of a random action selection: since the total probability of random action selections is $\epsilon$ and since all actions can be selected as random we calculate the probability of an action to be selected randomly as $\frac{\epsilon}{|\mathcal{A}|}$.

However, I did not understand how the probability $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$ for greedy action selection was derived. How is it calculated?

Neil Slater · Accepted Answer · 2022-01-30T12:53:13.497

I did not understand how the probability $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$

It is the sum of two mutally-exclusive possibilities:

The agent chooses to exploit, selecting the greedy action with probability $1 - \epsilon$
The agent chooses to explore (probability $\epsilon$), and so happens to randomly choose the original greedy action (probablility $\frac{1}{|\mathcal{A}|}$). Combined probability $\frac{\epsilon}{|\mathcal{A}|}$.

Although you might expect that exploring actions would exclude the greedy action, in $\epsilon$-greedy approach they do not. It keeps implementation simple, although it does result in this extra term cropping up in the expression for probability of choosing the greedy action.

Moving the excess probability to other actions, or adjusting the original exploration probability would also both result in an extra small term cropping up in other parts of the theory. There isn't a "clean" way to build an $\epsilon$-soft action probability distribution without this slight added complexity.

score 2 · Answer 2 · answered Jul 07 '22 at 10:57

I did not understand how the probability 1−ϵ+ϵ/|A|

It's about mutually exclusive probabilities. If A and B are mutually exclusive events, and C can happen in both:

Then, P(C) = P(C$\cap$A) + P(C$\cap$B) = (1−ϵ) +ϵ/|A|.

In English: the prob of C happening = (the prob of both C and A happening) + (the prob of C and B happening).

How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?

2 Answers2