Highest Voted 'upper-confidence-bound' Questions - Artificial Intelligence Stack Exchange

6

votes

1 answer

In MCTS, what to do if I do not want to simulate till the end of the game?

I'm trying to implement MCTS with UCT for a board game and I'm kinda stuck. The state space is quite large (3e15), and I'd like to compute a good move in less than 2 seconds. I already have MCTS implemented in Java from here, and I noticed that it…

monte-carlo-tree-search monte-carlo-methods upper-confidence-bound

asked Apr 05 '21 at 01:28

Sami

163
4

5

votes

1 answer

What should the initial UCT value be with MCTS, when leaf's simulation count is zero? Infinity?

I am implenting a Monte Carlo Tree Search algorithm, where the selection process is done through Upper Confidence Bound formula: def uct(state): log_n = math.log(state.parent.sim_count) explore_term = self.exploration_weight *…

monte-carlo-tree-search upper-confidence-bound tree-search

asked Jan 24 '21 at 12:33

semyd

153
1
6

5

votes

1 answer

Multi Armed Bandits with large number of arms

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms. Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…

reinforcement-learning multi-armed-bandits epsilon-greedy-policy upper-confidence-bound

asked Dec 16 '20 at 16:03

D. B.

101
1
7

4

votes

1 answer

Why do we have two similar action selection strategies for UCB1?

In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

reinforcement-learning policies sutton-barto multi-armed-bandits upper-confidence-bound

asked Oct 23 '20 at 10:32

nbro

42,615
12
119
217

4

votes

2 answers

Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)). As in the latter study, let us consider an…

papers notation multi-armed-bandits upper-confidence-bound

asked Jul 16 '20 at 13:41

MAB_N00B

41
3

4

votes

2 answers

Should I use exploration strategy in Policy Gradient algorithms?

In policy gradient algorithms the output is a stochastic policy - a probability for each action. I believe that if I follow the policy (sample an action from the policy) I make use of exploration because each action has a certain probability so I…

reinforcement-learning policy-gradients exploration-exploitation-tradeoff upper-confidence-bound thompson-sampling

asked Jun 06 '20 at 21:38

gnikol

177
7

3

votes

1 answer

How UCT in MCTS selection phase avoids starvation?

The first step of MCTS is to keep choosing nodes based on Upper Confidence Bound applied to trees (UCT) until it reaches a leaf node where UCT is defined as $$\frac{w_i}{n_i}+c\sqrt{\frac{ln(t)}{n_i}},$$ where $w_i$= number of wins after i-th…

monte-carlo-tree-search upper-confidence-bound

asked Feb 10 '21 at 12:05

user8714896

825
1
9
24

3

votes

3 answers

Why aren't exploration techniques, such as UCB or Thompson sampling, used in full RL problems?

Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems? Monte Carlo Tree Search may use the above-mentioned methods in its selection step, but why do value-based and policy…

reinforcement-learning multi-armed-bandits upper-confidence-bound thompson-sampling exploration-strategies

asked Nov 15 '20 at 19:10

Mika

371
2
10

2

votes

1 answer

UCB, Thompson sampling etc seems myopic/greedy for bandits?

When considering multi-armed bandits in different formats, UCB, $\epsilon$-greedy, thompson sampling etc seems so greedy/myopic in the sense that it solely considers reward for the current timestep. To be more precise, why dont we consider the…

reinforcement-learning markov-decision-process multi-armed-bandits upper-confidence-bound thompson-sampling

asked Sep 25 '23 at 20:12

hugh

53
3

2

votes

1 answer

Difference in UCB performance when scaling the rewards

I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1])…

reinforcement-learning rewards multi-armed-bandits upper-confidence-bound

asked Apr 23 '21 at 15:55

d56

243
1
7

2

votes

1 answer

In UCB, is the actual upper bound an upper bound of an one-sided or two-sided confidence interval?

I'm a bit confused about the visualization of the upper bound (following the notation of (c.f. Sutton & Barto (2018)) $$Q_t(a)+C\sqrt{\frac{\mathrm{ln}(t)}{N_t(a)}}$$ In many blog posts about the UCB(1)-algorithm, such as visualized in the following…

machine-learning reinforcement-learning multi-armed-bandits upper-confidence-bound

asked Jan 19 '21 at 20:23

D. B.

101
1
7

2

votes

0 answers

Why is the ideal exploration parameter in the UCT algorithm $\sqrt{2}$?

From Wikipedia, in the Monte-Carlo Tree Search algorithm, you should choose the node that maximizes the value: $${\displaystyle {\frac {w_{i}}{n_{i}}}+c{\sqrt {\frac {\ln N_{i}}{n_{i}}}}},$$ where ${w_{i}}$ stands for the number of wins for the…

reinforcement-learning monte-carlo-tree-search exploration-exploitation-tradeoff upper-confidence-bound

asked Dec 10 '20 at 16:25

Gilad Felsen

21
1

1

vote

0 answers

RL agent focusses too much on early rewards, even with no discounting

How can I guide my RL agent to solve tasks in the correct order? I'm trying to train an agent using reinforcement learning, similar to MuZero. The goal is to solve 4 tasks, A/B/C/D. Each task involves two actions, X1/X2. Initially, only action A1 is…

reinforcement-learning monte-carlo-tree-search upper-confidence-bound muzero

asked Oct 23 '23 at 17:30

Christopher

133
4

1

vote

0 answers

UCB-like algorithms: how do you compute the exploration bonus?

My question concerns Stochastic Combinatorial Multiarmed Bandits. More specifically, the algorithm called CombUCB1 presented in this paper. It is a UCB-like algorithm. Essentially, in each round of the sequential game the learner chooses a super-arm…

multi-armed-bandits online-learning upper-confidence-bound

asked Feb 04 '21 at 08:32

Adam

11
2

1

vote

1 answer

Why am I getting better performance with Thompson sampling than with UCB or $\epsilon$-greedy in a multi-armed bandit problem?

I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling. The results for the rewards I got are as follows: Thompson sampling had the highest average reward UCB was second $\epsilon$-greedy was third,…

reinforcement-learning multi-armed-bandits thompson-sampling upper-confidence-bound epsilon-greedy-policy

asked Jun 15 '20 at 22:08

Java coder

11
1
2

Questions tagged [upper-confidence-bound]