3

In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following:

The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability $\epsilon$ they instead select an action at random. That is, all nongreedy actions are given the minimal probability of selection, $\frac{\epsilon}{|\mathcal{A}|}$, and the remaining bulk of the probability, $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$, is given to the greedy action.

I understood the probability of a random action selection: since the total probability of random action selections is $\epsilon$ and since all actions can be selected as random we calculate the probability of an action to be selected randomly as $\frac{\epsilon}{|\mathcal{A}|}$.

However, I did not understand how the probability $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$ for greedy action selection was derived. How is it calculated?

nbro
  • 42,615
  • 12
  • 119
  • 217
user3489173
  • 309
  • 7

2 Answers2

3

I did not understand how the probability $1-\epsilon+\frac{\epsilon}{|\mathcal{A}|}$

It is the sum of two mutally-exclusive possibilities:

  • The agent chooses to exploit, selecting the greedy action with probability $1 - \epsilon$

  • The agent chooses to explore (probability $\epsilon$), and so happens to randomly choose the original greedy action (probablility $\frac{1}{|\mathcal{A}|}$). Combined probability $\frac{\epsilon}{|\mathcal{A}|}$.

Although you might expect that exploring actions would exclude the greedy action, in $\epsilon$-greedy approach they do not. It keeps implementation simple, although it does result in this extra term cropping up in the expression for probability of choosing the greedy action.

Moving the excess probability to other actions, or adjusting the original exploration probability would also both result in an extra small term cropping up in other parts of the theory. There isn't a "clean" way to build an $\epsilon$-soft action probability distribution without this slight added complexity.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
2

I did not understand how the probability 1−ϵ+ϵ/|A|

It's about mutually exclusive probabilities. If A and B are mutually exclusive events, and C can happen in both:

Mutually exclusive probabilities

Then, P(C) = P(C$\cap$A) + P(C$\cap$B) = (1−ϵ) +ϵ/|A|.

In English: the prob of C happening = (the prob of both C and A happening) + (the prob of C and B happening).