2

I notice the following behavior when running experiments with $\epsilon$-greedy and UCB1. If the reward is kept binary (0 or 1) both algorithm's performances are on par with each other. However, if I make the reward continuous (and bounded [0, 1]) then $\epsilon$-greedy remains good but UCB1 performance plummets. As an experiment, I just scaled the reward of 1 by a factor of 1/10 which negatively influences the performance.

I have plotted the reward values estimated by the algorithms and see that (due to the confidence interval term) UCB1 largely overestimates the rewards.

Is there a practical trick to fix that? My guess is that the scaling coefficient $c$ in front of the upper confidence bound is meant just for this case. Nevertheless, the difference in performance is staggering to me. How do I know when and what scaling coefficient will be appropriate?

===

update 1 The reward distribution is very simple. There are 17 arms, for arm 3 and 4 the learning algorithm gets a reward of 1, the other arms return reward of 0. No stochasticity, the algorithm has 1000 iterations.

If I scale the reward by a factor 1/10, for instance, then UCB1 takes a whole lot of time to start catching up with $\epsilon$-greedy

d56
  • 243
  • 1
  • 7

1 Answers1

2

Epsilon greedy is unaffected by scaling of rewards, it always selects a random action with a probability of epsilon.

On the other hand, if we look at the formulation of UCB (Section 2.7 of Reinforcement Learning, Sutton and Barto):

$$A_t \doteq \underset{a}{\operatorname{argmax}} [\mathcal{Q}_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}]$$

Where $Q_t(a)= \frac{R_1 + R_2 + \dots + R_{n-1}}{n-1}$ is the average of rewards seen so far for that action. By scaling the rewards, for example by 1/10, you are essentially scaling $Q_t(a)$ by 1/10 which will change UCBs behaviour to favour the exploratory right hand term $c\sqrt{\frac{\ln t}{N_t(a)}}$. If you want to offset your scaling of the rewards you should scale c by the same amount. Theoretically, I believe this should result in performance the same as that achieved pre-scaling.

In fact, another way of looking at it is that when using UCB, you could dispense with the $c$ constant altogether and simply scale the rewards to tune the exploitation/exploration trade-off. Smaller rewards favour exploration, larger rewards favour exploitation.

James
  • 251
  • 1
  • 5