5

I want to use Reinforcement Learning to optimize the distribution of energy for a peak shaving problem given by a thermodynamical simulation. However, I am not sure how to proceed as the action space is the only thing that really matters, in this sense:

  • The action space is a $288 \times 66$ matrix of real numbers between $0$ and $1$. The output of the simulation and therefore my reward depend solely on the distribution of this matrix.

  • The state space is therefore absent, as the only thing that matters is the matrix on which I have total control. At this stage of the simulation, no other variables are taken into consideration.

I am not sure if this problem falls into the tabular RL or it requires approximation. In this case, I was thinking about using a policy gradient algorithm for figuring out the best distribution of the $288 \times 66$ matrix. However, I do not know how to behave with the "absence" of the state space. Instead of a tuple $\langle s,a,r,s' \rangle$, I would just have $\langle a, r \rangle$, is this even an RL-approachable problem? If not, how can I reshape it to make it solvable with RL techniques?

nbro
  • 42,615
  • 12
  • 119
  • 217
FS93
  • 145
  • 6

1 Answers1

3

A stateless RL problem can be reduced to a Multiarmed Bandit (MAB) problem. In such a scenario, taking an action will not change the state of the agent.

So, this is the setting of a conventional MAB problem: at each time step, the agent selects an action to either perform an exploration or exploitation move. It then records the reward of the taken action and updates its estimation/expectation of the usefulness of the action. Then, repeats the procedure (selection, observing, updating).

To chose between exploration and exploitation moves, MAB agents adopt a strategy. The simplest one would probably be $\epsilon$-greedy which agent chooses the most rewarding actions most of the time (1-$\epsilon$ probability) or randomly selects an action ($\epsilon$ probability).

Borhan Kazimipour
  • 876
  • 1
  • 10
  • 20