5

I try to apply RL for a control problem, and I intend to either use Deep Q-Learning or SARSA.

I have two heating storage systems with one heating device, and the RL agent is only allowed to heat up 1 for every time slot. How can I do that?

I have two continuous variables $x(t)$ and $y(t)$, where $x(t)$ quantifies the degree of maximum power for heating up storage 1 and $y(t)$ quantifies the degree of maximum power for heating up storage 2.

Now, IF $x(t) > 0$, THEN $y(t)$ has to be $0$, and vice versa with $x(t)$ and $y(t)$ element $0$ or $[0.25, 1]$. How can I tell this to the agent?

One way would be to adjust the actions after the RL agent has decided about that with a separate control algorithm that overrules the actions of the RL agent. I am wondering if and how this can be also done directly? I'll appreciate every comment.

Update: Of course I could do this with a reward function. But is there not a direct way of doing this? Because this is actually a so called hard constraint. The agent is not allowed to violate this at all as this is technically not feasible. So it will be better to tell the agent directly not to do this (if that is possible).

Reminder: Can anyone tell me more about this issue? I'll highly appreciate any further comment and will be quite thankful for your help. I will also award a bounty for a good answer.

PeterBe
  • 276
  • 3
  • 14

3 Answers3

4

You could just tweak your reward function to include this restrictions.

In the most simple case, you could reward your agent -1 if $x(t) > 0$ and $y(t) \neq 0$.

The scale of your negative reward depends on your general reward scaling of course.

tnfru
  • 358
  • 1
  • 12
1

I'm not an expert, but as far as I understand, you should use an off-policy algorithm, the difference between is:

On-Policy: The agent learns the value function according to the current action derived from current the policy being used. Off-Policy: The agent learns the value function according to the action derived from another policy.

This means that you can use another policy to explore. For example, if you use a Q-Learning (not your case because of the continuos values of your problem) that is an off-policy approach, you can explore with a particular policy to get the actions (you can only select valid actions) then you can update your q-table with the Q-Learning equation.

In your case you can use an off-policy deep approach. I suggest DDPG/TD3, you can look about some of them briefly here.

The idea is to use an exploration policy, the one you restrict to only select valid values (hard-constraint), and integrate the State, Action, Reward, State' in the replay buffer. The Stable_Baseline library doesnt allow that, but you could check the original source code of TD3.

Edit1:

If you see in the Q learning algorithm, the e-greedy consist on selecting with a probability of $\epsilon$ $a \gets \text{any action}$, and with $1-\epsilon$ the $a \gets max_{a}Q(s,a)$. This $\text{any action}$ is the part of the code that you use this "controller" to only select random (but valid) actions. This is because you want to explore but only explore with valid actions. Then the Q learning can "exploit" picking the best action from the exploration you did before. Now, for your case with continuos actions, you can use DDPG/TD3 to do something similar but you store these valid actions in a replay buffer, so your Neural Network can learn for this "data" of only valid actions.

Q learning

Edit 2:

In your custom environment you can define your action space like:

self.action_space = gym.spaces.Box(low=-1, high=1, shape=(1,))

Now, as you said, in the step function of your environment you can establish the x(t) and y(t)

maxX=10 #Depends on the maximum value of your x(t), I assigned a 10
maxY=10 #Depends on the maximum value of your y(t), I assigned a 10
x=0
y=0
if action>0:
    y=0
    x=action*maxX
elif action<0:
    x = 0
    # you need to multiply by -1 because your action is negative
    y = -1*action * maxY 
# do the rest of the code of your controler with x and y

In this way, your RL agent will learn which action (between -1 and 1) will get the best reward, but in the step function, you map the action [-1 +1] to your true values.

Pulse9
  • 282
  • 1
  • 7
0

When you take a step in the DQL process, you sample a move based on the estimated qualities of each possible action. During that step, you can restrict your sampling method to have probability 0 of choosing the forbidden action.

nnolte
  • 101