5

I'm dealing with a (stochastic) Multi Armed Bandit (MAB) with a large number of arms.

Consider a pizza machine that produces a pizza depending on an input $i$ (equivalent to an arm). The (finite) set of arms $K$ is given by $K=X_1\times X_2 \times X_3\times X_4$ where $X_j$ denote the set of possible amounts of ingredient $j$.

e.g. $X_1=\{$ small, medium, large $\}$ (amount of cheese) or $X_2=\{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}$ (slices of salami)

Thus, running the pizza machine with input $i$ is equivalent to pulling arm $i\in K$. Due to the different permutations, the number of arms $|K|$ is very large (between 100,000 and 1,000,000). Depending on the pulled arm $i$, the machine generates a pizza (associated with a reward that indicates how delicious the pizza is). However, the machine's rewards are non-static. Pulling an arm $i$ generates a reward according to an unknown (arm specific) distribution $P_i$, with all rewards drawn from $P_i$ beeing i.i.d.. In addition, it is possible to normalize all rewards to the interval [0,1].

The above problem corresponds to the standard problem of a stochastic MAB, but is characterized by the large number of arms. In the case of the pizza machine, several days of computation time are available to determine the best pizza, so the number of itarations is allowed to be large as well.

In my investigation of MAB algorithms addressing a large number of arms, I came across studies that could call up to a few thousand arms.

Are there algorithms that exist in the MAB domain, which specifically deal with large problem instances (e.g.with $|K|>100,000$)?

cngzz1
  • 47
  • 7
D. B.
  • 101
  • 1
  • 7

1 Answers1

1

Without any knowledge on the references you came across, I am assuming that the authors were considering common applications of MAB (planning, online learning, etc.) for which the time horizon is usually small. In such applications, we usually cannot afford a large average regret which is inevitable for standard MAB algorithms due to the $\sqrt{|K|}$ factor.

Depending on the application or additional constraints you can impose on your problem, there are several works that consider structured stochastic MABs in which there are much better guarantees than the pessimistic $\sqrt{K}$ bounds. One variant of structured MABs is learning on graphs [1], where one can obtain regret bounds growing with $\sqrt{\beta(G)}$, where $\beta(G)$ is the independence number of a graph $G$.

[1] F. Liu, Z. Zheng, N. Shroff, "Analysis of Thompson Sampling for Graphical Bandits Without the Graphs", ArXiv. 2018.

rhdxor
  • 206
  • 1
  • 4