For questions about the $\epsilon$-greedy policy, which is typically used as a behavioural policy (i.e. a policy used to interact with the environment) during the interaction of reinforcement learning agents with the environment.
Questions tagged [epsilon-greedy-policy]
29 questions
7
votes
1 answer
What happens when you select actions using softmax instead of epsilon greedy in DQN?
I understand the two major branches of RL are Q-Learning and Policy Gradient methods.
From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability…
Linsu Han
- 73
- 1
- 4
6
votes
1 answer
What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?
I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following:
In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…
Daviiid
- 585
- 5
- 17
6
votes
1 answer
Is this proof of $\epsilon$-greedy policy improvement correct?
The following paragraph about $\epsilon$-greedy policies can be found at the end of page 100, under section 5.4, of the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto (second edition, 2018).
but with probability…
Jarvis1997
- 157
- 6
5
votes
1 answer
Why does Q-learning converge under 100% exploration rate?
I am working on this assignment where I made the agent learn state-action values (Q-values) with Q-learning and 100% exploration rate. The environment is the classic gridworld as shown in the following picture.
Here are the values of my…
Rim Sleimi
- 215
- 1
- 8
5
votes
1 answer
Multi Armed Bandits with large number of arms
I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms.
Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…
D. B.
- 101
- 1
- 7
4
votes
1 answer
What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?
I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation:
$$\frac{\varepsilon}{|\mathcal{A}(s)|}…
Metrician
- 195
- 5
3
votes
2 answers
How do I use the decaying $\epsilon$-greedy strategy in Q-learning?
The decaying $\epsilon$-greedy strategy used in the Q-learning algorithm, when should the decay of this $\epsilon$ occur, after each episode (initial state to terminal state), or after each step (one state change) in an episode?
Cathy
- 43
- 3
3
votes
1 answer
Do eligibility traces and epsilon-greedy do the same task in different ways?
I understand that in Reinforcement Learning algorithms, such as Q-learning, in order to prevent selecting the actions with greatest q-values too fast and allow for exploration, we use eligibility traces.
Does $\epsilon$-greedy solve the same…
Abhishek Dhyani
- 41
- 3
3
votes
2 answers
How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?
In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following:
The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal…
user3489173
- 309
- 7
3
votes
2 answers
How to fight with unstability in self play?
I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…
Maras
- 141
- 6
3
votes
1 answer
What should the value of epsilon be in the Q-learning?
I am trying to understand Reinforcement Learning and already explored different Youtube videos, blog posts, and Wikipedia articles.
What I don't understand is the impact of $\epsilon$. What value should it take? $0.5$, $0.6$, or $0.7$?
What does it…
Exploring
- 371
- 7
- 18
3
votes
1 answer
Can we stop training as soon as epsilon is small?
I'm new to reinforcement learning.
As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time,…
Micha Christ
- 31
- 1
3
votes
1 answer
Is there an advantage in decaying $\epsilon$ during Q-Learning?
If the agent is following an $\epsilon$-greedy policy derived from Q, is there any advantage to decaying $\epsilon$ even though $\epsilon$ decay is not required for convergence?
KaneM
- 307
- 2
- 13
3
votes
1 answer
What is the difference between the $\epsilon$-greedy and softmax policies?
Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…
FraMan
- 199
- 1
- 4
- 11
2
votes
1 answer
How to code an $\epsilon$-soft policy for on-policy Monte Carlo control?
I was trying to code the on-policy Monte Carlo control method. The initial policy chosen needs to be an $\epsilon$-soft policy.
Can someone tell me how to code an $\epsilon$-soft policy?
I know how to code the $\epsilon$-greedy. In $\epsilon$-soft,…
A Q
- 23
- 4