How to Weigt Constraints in A Control Problem with Reinforcement Learning

Question

I have a control problem for a heating device of a building with the goal to minimize the electricity costs for one day under a varying price for electricity in every hour. (more details can be seen here as well: Reinforcement learning applicable to a scheduling problem?).

I also want to test two further goals (minimize peak load and maximize PV self-consumption rate).

My problem also has about 10 constraints that should not be violated. I have two main questions about how to integrate the constraints into the Reinforcement Learning agent:

Here are my two main questions (with following minor questions):

(1) Basically I have three goals with normalized rewards between 0 and 1 for every time-slot and I have 10 constraints.

Should the constraints reward also be normalized for all 10 constraints? And then should I choose a higher weight for the most important constraint than for all three goals combined such that a constraint violation is more crucial than getting a better objective value for all the three goals?

(2) Is it also possible to tell the Reinforcement Learning agent some rules directly without any constraints?

E.g. I have two storage systems, and the agent is only allowed to heat up 1 for every time-slot. Further, the agent should not start and stop heating up frequently (like around four starts of the device daily is desirable).

Can I explicitly tell these rules to the agent? Or do I have to do it indirectly by calculating a reward for every of these constraints and incorporate the weighted reward into the overall reward function of the agent?

I'll appreciate any suggestion and comment.

Neil Slater · Accepted Answer · 2021-09-24T18:56:10.123

Should the constraints reward also be normalized for all 10 constraints?

You should choose a "natural" balance between rewards where possible.

If you have many separate goals to take account of, ideally you should convert them all into some comparable metric that is meangful to the success of the agent. Such as a financial gain/loss, or energy gain/loss, or similar. You can normalise them after this, but the ratios between the values should be kept the same.

This is not always possible with constraints.

For strict constraints, you should ideally ensure that breaking any constraint will score worse than not breaking the constraint but scoring very badly at everything else. If your system is gaining positive rewards from operating within bounds (and it seems that it is from the description), then one simple way to achieve this is to terminate the episode early and score $0$ for breaking the constraint. If the constraint relates to an ongoing state measurement this may be the best option. That is because the agent learning how to "escape" from an unreachable state may not be useful.

For soft constraints, you need to decide a relative cost. For example your constraint:

the agent should not start and stop heating up frequently (like around four starts of the device daily is desirable)

Looks very much like a soft constraint - using words like "around" and "desirable". For something like that, I would probably allow four starts, then add largish penalty for each start beyond that. What that value is, should be related back to the natural balance between rewards and why you want this constraint.

As an aside, in order for the agent to learn about this constraint, you must add the number of starts so far for each device to the state. This is true for all constraints - there must be data inside the current state that the agent could use to predict that the constraint will come into play. It doesn't need to know the limit you are applying, but does need to know the current value of any variables used to decide whether a limit should be enforced.

I think you will also want to store what the last action was, or which device is currently on, so that the agent knows to keep it on in order not to waste one of its four uses of the device per day.

Is it also possible to tell the Reinforcement Learning agent some rules directly without any constraints?

Yes for absolute rules that should prevent the agent taking a specific action in the first place, and that will also be enforced in any production system. For example this rule:

I have two storage systems, and the agent is only allowed to heat up 1 for every time-slot.

Can be easily expressed by having three actions:

$a_0$ no heating
$a_1$ heat device 1
$a_2$ heat device 2

Although you may want:

$a_0$ do nothing
$a_1$ start device 1 (and stop device 2 if running)
$a_2$ start device 2 (and stop device 1 if running)
$a_3$ stop whichever device is running

This second list may help the agent from flip/flopping devices during early stages of learning, and speed things up a little.

Other action sets could work too. The important thing is that there is nothing to gain by presenting the agent with action choices that you are ruling out as not possible immediately. It would mean adding more negative rewards and constraints to what already looks like a complex problem.

How to Weigt Constraints in A Control Problem with Reinforcement Learning

1 Answers1

Linked