2

I am trying to find research papers addressing a problem that, in my opinion, deserves significant attention. However, I am having difficulty locating relevant information.

To illustrate the problem at hand, consider a multivariate Bayesian regression model represented as $y_t = b_1 \cdot x_{1,t}^{c_1} + b_2 \cdot x_{2,t}^{c_2}$. The purpose of this model is to estimate rewards $y_t$ based on inputs ($x_{1,t}$, $x_{2,t}$) at each time step denoted by $t$. The parameters $b_1$ and $b_2$ are assumed to follow a half-normal distribution, while $c_1$ and $c_2$ follow a gamma distribution. I aim to plan inputs over a given horizon $H$ given a specified budget $\left(\sum_t x_{1,t} + \sum_{t} x_{2,t} \le \mbox{budget}\right)$, with the main objective of maximizing cumulative rewards. To help with this goal, it is important to also gain knowledge about the model parameters to facilitate improved allocation of inputs for maximizing cumulative rewards in the future.

Every day we make a decision on how to allocate a portion of the total budget and every day we receive a response ($y=\mbox{sales}$). Typically, we have access to 30-300 daily data points to establish the initial posterior, and the planning horizon is typically 30 days.

This problem exhibits a trade-off between exploration and exploitation, which is commonly encountered in communities such as reinforcement learning and dual control. However, the constraint of only being able to run experiments based on the exact functional form of the model, rather than simulations in the real environment, makes it challenging to apply standard reinforcement learning techniques. While elements of this problem can be found in other communities, such as Bayesian experimental design and active learning, these approaches primarily address the exploration aspect.

An intuitive strategy could involve incorporating certain aspects of active learning, such as maximum information gain, along with an $\epsilon$-greedy approach. In this strategy, the mean of the posterior predictive distribution of the input can be used during exploitation, while maximum information gain can guide exploration by proposing new inputs. I can of course think of more approaches, Thompson sampling being one of them where we sample a single sample of parameters from our posterior and use it as ground truth in the downstream optimization task. Another one is to take multiple samples and use the mean of the posterior predictive distribution. One could be UCB with some tweaks etc. A third one could be to take a Thompson sample and compute the downstream optimization task and repeat it iteratively to get an comprehension of the variance of possible solutions and maybe in the end take the mean or median.


Despite the belief that this problem falls within the realm of sequential decision optimization and should have received extensive research attention, it is challenging to find specific resources that tackle this type of problem.

Question: In which communities can I find more information about my problem type?

Crossposted: https://or.stackexchange.com/questions/10611/bayesian-regression-model-as-reward-function-in-exploration-vs-exploitation-sett


EDIT

After reading the reports provided in the response below and examining various model-based Bayesian reinforcement learning techniques, several concerns arose in my mind, accompanied by potential solution strategies. To address these concerns and demonstrate some techniques applicable to resolving this matter, I initiated a new discussion thread. The problem formulation in that thread bears a striking resemblance to the one above, albeit with additional contextual information that closely aligns with the real problem I am currently engaged in. This thread can be read here: Methods for sequential decision optimization problem with nonlinear bayesian reward function In this current question, I am solely interested in hearing about which communities have been trying to attack this problem type.

paul
  • 35
  • 1
  • 5

1 Answers1

1

One community that has very recently been attacking problems of the type posed by your question is the Bayesian sequential optimal experimental design (Bayesian sOED) community. The Bayesian sOED setting generally assumes that there is a current belief state $x_k \in \mathcal{X}$ over an underlying system with some unknown quantity (in your case, unknown parameters) and possible design choices $d_k \in \mathcal{D}$ that may be conditioned on the current belief state. Performing experiment $k \in \{0, 1, 2, \ldots, N-1\}$ is equivalent to choosing a design $d_k$, receiving an observation $y_k \in \mathcal{Y}$ from the underlying system, and earning a utility $g(x_k, d_k, y_k) \in \mathbb{R}$ based on the current belief state $x_k$, design choice $d_k$, and observation $y_k$. The goal is to choose a sequence of designs $(d_0, d_1, d_2, \ldots, d_{N-1})$ that maximizes the expected sum of utility across all experiments. Learning more information about the underlying system (e.g. the unknown parameters) is directly helpful in achieving that goal by producing more informative belief states $x_{k+1}$ after the conclusion of each experiment.

The above experimental design setting is:

  • Bayesian due to incorporating a belief state over the underlying system,
  • sequential because experiments are performed sequentially to use all information about past experiments as opposed to performing all experiments concurrently (batch design),
  • optimal due to maximizing the expected sum of utility across all experiments as opposed to only performing the maximization over the next experiment (greedy design).

Since your exact question is simply asking about relevant research communities, I won't make any specific remarks in this answer regarding how to solve your illustrative problem using Bayesian sOED (feel free to make a follow-up question on that topic, if desired). Instead, I will list some references that I personally have found useful for solving problems in this field, with those most directly related to the Bayesian sOED problem listed first:

DeepQZero
  • 1,733
  • 1
  • 10
  • 36