1

I am trying to understand if this problem can be casted both as a bandit problem as well as an MDP.

Lets assume that we are trying to optimize sales $y_t$ based on investments $x_{1, t}, x_{2, t}$ over some horizon $H$. To model sales for timestep $t$ we have the following function $y_t = \beta_0 * x_{1, t} + \beta_1 * x_{2, t}$ where as we have the following priors $\beta_0 \sim Beta(2, 1)$ and $\beta_1$ consists of dummies associating with day of the week, each dummie is distributed as $\sim Beta(2, 1)$, thus $6$ dummies in total. We can assume the parametrization is correct, however, the priors may be far from the true parameters governing the true sales equation. Thus a typical exploration vs exploitation dilemma.

Lets assume a normally distributed likelihood $N(y_t| 1)$.

We are solely allowed to play $3$ actions every timestep:

$Action \ 1: {x_1 = 0, x_2 = 1}, Action \ 2: {x_1 = 1, x_2 = 0}, Action \ 3: {x_1 = 1, x_2=1}$

Lets assume we want to plan over an horizon of $5$ timesteps.

Every timestep(daily) we take an action and every timestep we receive a reward.

We have an total action budget $B$ over the whole horizon that we should not exceed.


Question Should this problem be casted as an MDP or as a bandit and why?


My intuition tells me that due to the timevarying parameter $\beta_1$ and the total budget constraint $B$ this should be casted as an MDP since actions today have an impact on in this case our "physical state" tomorrow. However, i have seen total budgets for bandits before which makes me doubt. Is the budget constraint really the only thing that makes this impossible to model as a bandit..

hugh
  • 53
  • 3

0 Answers0