Questions tagged [upper-confidence-bound]

For questions about the upper confidence bound (UCB)-based algorithms or action selection strategies in the context e.g. of bandit or reinforcement learning problems.

17 questions
6
votes
1 answer

In MCTS, what to do if I do not want to simulate till the end of the game?

I'm trying to implement MCTS with UCT for a board game and I'm kinda stuck. The state space is quite large (3e15), and I'd like to compute a good move in less than 2 seconds. I already have MCTS implemented in Java from here, and I noticed that it…
5
votes
1 answer

What should the initial UCT value be with MCTS, when leaf's simulation count is zero? Infinity?

I am implenting a Monte Carlo Tree Search algorithm, where the selection process is done through Upper Confidence Bound formula: def uct(state): log_n = math.log(state.parent.sim_count) explore_term = self.exploration_weight *…
5
votes
1 answer

Multi Armed Bandits with large number of arms

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms. Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times…
4
votes
1 answer

Why do we have two similar action selection strategies for UCB1?

In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…
4
votes
2 answers

Why do we use $X_{I_t,t}$ and $v_{I_t}$ to denote the reward received and the at time step $t$ and the distribution of the chosen arm $I_t$?

I'm doing some introductory research on classical (stochastic) MABs. However, I'm a little confused about the common notation (e.g. in the popular paper of Auer (2002) or Bubeck and Cesa-Bianchi (2012)). As in the latter study, let us consider an…
4
votes
2 answers

Should I use exploration strategy in Policy Gradient algorithms?

In policy gradient algorithms the output is a stochastic policy - a probability for each action. I believe that if I follow the policy (sample an action from the policy) I make use of exploration because each action has a certain probability so I…
3
votes
1 answer

How UCT in MCTS selection phase avoids starvation?

The first step of MCTS is to keep choosing nodes based on Upper Confidence Bound applied to trees (UCT) until it reaches a leaf node where UCT is defined as $$\frac{w_i}{n_i}+c\sqrt{\frac{ln(t)}{n_i}},$$ where $w_i$= number of wins after i-th…
user8714896
  • 825
  • 1
  • 9
  • 24
3
votes
3 answers

Why aren't exploration techniques, such as UCB or Thompson sampling, used in full RL problems?

Why aren't exploration techniques, such as UCB or Thompson sampling, typically used in bandit problems, used in full RL problems? Monte Carlo Tree Search may use the above-mentioned methods in its selection step, but why do value-based and policy…
2
votes
1 answer

UCB, Thompson sampling etc seems myopic/greedy for bandits?

When considering multi-armed bandits in different formats, UCB, $\epsilon$-greedy, thompson sampling etc seems so greedy/myopic in the sense that it solely considers reward for the current timestep. To be more precise, why dont we consider the…
2
votes
1 answer

Difference in UCB performance when scaling the rewards

I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1])…
2
votes
1 answer

In UCB, is the actual upper bound an upper bound of an one-sided or two-sided confidence interval?

I'm a bit confused about the visualization of the upper bound (following the notation of (c.f. Sutton & Barto (2018)) $$Q_t(a)+C\sqrt{\frac{\mathrm{ln}(t)}{N_t(a)}}$$ In many blog posts about the UCB(1)-algorithm, such as visualized in the following…
2
votes
0 answers

Why is the ideal exploration parameter in the UCT algorithm $\sqrt{2}$?

From Wikipedia, in the Monte-Carlo Tree Search algorithm, you should choose the node that maximizes the value: $${\displaystyle {\frac {w_{i}}{n_{i}}}+c{\sqrt {\frac {\ln N_{i}}{n_{i}}}}},$$ where ${w_{i}}$ stands for the number of wins for the…
1
vote
0 answers

RL agent focusses too much on early rewards, even with no discounting

How can I guide my RL agent to solve tasks in the correct order? I'm trying to train an agent using reinforcement learning, similar to MuZero. The goal is to solve 4 tasks, A/B/C/D. Each task involves two actions, X1/X2. Initially, only action A1 is…
1
vote
0 answers

UCB-like algorithms: how do you compute the exploration bonus?

My question concerns Stochastic Combinatorial Multiarmed Bandits. More specifically, the algorithm called CombUCB1 presented in this paper. It is a UCB-like algorithm. Essentially, in each round of the sequential game the learner chooses a super-arm…
1
vote
1 answer

Why am I getting better performance with Thompson sampling than with UCB or $\epsilon$-greedy in a multi-armed bandit problem?

I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling. The results for the rewards I got are as follows: Thompson sampling had the highest average reward UCB was second $\epsilon$-greedy was third,…
1
2