Highest Voted 'epsilon-greedy-policy' Questions - Artificial Intelligence Stack Exchange

7

votes

1 answer

What happens when you select actions using softmax instead of epsilon greedy in DQN?

I understand the two major branches of RL are Q-Learning and Policy Gradient methods. From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability…

asked Jun 23 '20 at 16:47

Linsu Han

73
1
4

6

votes

1 answer

What is the probability of selecting the greedy action in a 0.5-greedy selection method for the 2-armed bandit problem?

I'm new to reinforcement learning and I'm going through Sutton and Barto. Exercise 2.1 states the following: In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon=0.5$, what is the probability that the greedy action…

reinforcement-learning multi-armed-bandits epsilon-greedy-policy

asked May 23 '21 at 20:48

Daviiid

585
5
17

6

votes

1 answer

Is this proof of $\epsilon$-greedy policy improvement correct?

The following paragraph about $\epsilon$-greedy policies can be found at the end of page 100, under section 5.4, of the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto (second edition, 2018). but with probability…

reinforcement-learning proofs epsilon-greedy-policy policy-improvement-theorem

asked May 27 '20 at 12:44

Jarvis1997

157
6

5

votes

1 answer

Why does Q-learning converge under 100% exploration rate?

I am working on this assignment where I made the agent learn state-action values (Q-values) with Q-learning and 100% exploration rate. The environment is the classic gridworld as shown in the following picture. Here are the values of my…

reinforcement-learning q-learning convergence epsilon-greedy-policy exploration-strategies

asked Feb 20 '21 at 12:04

Rim Sleimi

215
1
8

5

votes

1 answer

Multi Armed Bandits with large number of arms

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms. Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

reinforcement-learning multi-armed-bandits epsilon-greedy-policy upper-confidence-bound

asked Dec 16 '20 at 16:03

D. B.

101
1
7

4

votes

1 answer

What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?

I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation: $$\frac{\varepsilon}{|\mathcal{A}(s)|}…

reinforcement-learning monte-carlo-methods notation on-policy-methods epsilon-greedy-policy

asked Jul 14 '20 at 20:11

Metrician

195
5

3

votes

2 answers

How do I use the decaying $\epsilon$-greedy strategy in Q-learning?

The decaying $\epsilon$-greedy strategy used in the Q-learning algorithm, when should the decay of this $\epsilon$ occur, after each episode (initial state to terminal state), or after each step (one state change) in an episode?

reinforcement-learning machine-learning training q-learning epsilon-greedy-policy

asked Feb 02 '25 at 16:26

Cathy

43
3

3

votes

1 answer

Do eligibility traces and epsilon-greedy do the same task in different ways?

I understand that in Reinforcement Learning algorithms, such as Q-learning, in order to prevent selecting the actions with greatest q-values too fast and allow for exploration, we use eligibility traces. Does $\epsilon$-greedy solve the same…

reinforcement-learning comparison exploration-exploitation-tradeoff epsilon-greedy-policy eligibility-traces

asked Oct 21 '17 at 07:39

Abhishek Dhyani

41
3

3

votes

2 answers

How is the probability of a greedy action in "$\epsilon$-greedy policies" derived?

In Sutton & Barto's book on reinforcement learning (section 5.4, p. 100) we have the following: The on-policy method we present in this section uses $\epsilon$ greedy policies, meaning that most of the time they choose an action that has maximal…

reinforcement-learning epsilon-greedy-policy

asked Jan 30 '22 at 12:23

user3489173

309
7

3

votes

2 answers

How to fight with unstability in self play?

I'm working on a neural network that plays some board games like reversi or tic-tac-toe (zero-sum games, two players). I'm trying to have one network topology for all the games - I specifically don't want to set any limit for the number of available…

reinforcement-learning convolutional-neural-networks deep-rl epsilon-greedy-policy self-play

asked Apr 10 '21 at 22:13

Maras

141
6

3

votes

1 answer

What should the value of epsilon be in the Q-learning?

I am trying to understand Reinforcement Learning and already explored different Youtube videos, blog posts, and Wikipedia articles. What I don't understand is the impact of $\epsilon$. What value should it take? $0.5$, $0.6$, or $0.7$? What does it…

reinforcement-learning q-learning hyperparameter-optimization hyper-parameters epsilon-greedy-policy

asked Nov 26 '20 at 00:29

Exploring

371
7
18

3

votes

1 answer

Can we stop training as soon as epsilon is small?

I'm new to reinforcement learning. As it is common in RL, $\epsilon$-greedy search for the behavior/exploration is used. So, at the beginning of the training, $\epsilon$ is high, and therefore a lot of random actions are chosen. With time,…

reinforcement-learning q-learning value-functions exploration-exploitation-tradeoff epsilon-greedy-policy

asked Oct 01 '20 at 08:32

Micha Christ

31
1

3

votes

1 answer

Is there an advantage in decaying $\epsilon$ during Q-Learning?

If the agent is following an $\epsilon$-greedy policy derived from Q, is there any advantage to decaying $\epsilon$ even though $\epsilon$ decay is not required for convergence?

reinforcement-learning q-learning convergence epsilon-greedy-policy exploration-strategies

asked Feb 27 '20 at 17:59

KaneM

307
2
13

3

votes

1 answer

What is the difference between the $\epsilon$-greedy and softmax policies?

Could someone explain to me which is the key difference between the $\epsilon$-greedy policy and the softmax policy? In particular, in the contest of SARSA and Q-Learning algorithms. I understood the main difference between these two algorithms, but…

reinforcement-learning q-learning sarsa epsilon-greedy-policy softmax-policy

asked Jan 21 '20 at 20:39

FraMan

199
1
4
11

2

votes

1 answer

How to code an $\epsilon$-soft policy for on-policy Monte Carlo control?

I was trying to code the on-policy Monte Carlo control method. The initial policy chosen needs to be an $\epsilon$-soft policy. Can someone tell me how to code an $\epsilon$-soft policy? I know how to code the $\epsilon$-greedy. In $\epsilon$-soft,…

reinforcement-learning implementation monte-carlo-methods on-policy-methods epsilon-greedy-policy

asked Nov 15 '21 at 17:00

A Q

23
4

Questions tagged [epsilon-greedy-policy]