How do I handle negative rewards in policy gradients with the cross-entropy loss function?

Question

I am using policy gradients in my reinforcement learning algorithm, and occasionally my environment provides a severe penalty (i.e. negative reward) when a wrong move is made. I'm using a neural network with stochastic gradient descent to learn the policy. To do this, my loss is essentially the cross-entropy loss of the action distribution multiplied by the discounted rewards, where most often the rewards are positive.

But how do I handle negative rewards? Since the loss will occasionally go negative, it will think these actions are very good, and will strengthen the weights in the direction of the penalties. Is this correct, and if so, what can I do about it?

Edit:

In thinking about this a little more, SGD doesn't necessarily directly weaken weights, it only strengthens weights in the direction of the gradient and as a side-effect, weights get diminished for other states outside the gradient, correct? So I can simply set reward=0 when the reward is negative, and those states will be ignored in the gradient update. It still seems unproductive to not account for states that are really bad, and it'd be nice to include them somehow. Unless I'm misunderstanding something fundamental here.

Tahlor · Answer 1 · 2019-10-16T17:23:17.077

It depends on your loss function, but you probably need to tweak it.

If you are using an update rule like loss = -log(probabilities) * reward, then your loss is high when you unexpectedly got a large reward—the policy will update to make that action more likely to realize that gain.

Conversely, if you get a negative reward with high probability, this will result in negative loss—however, in minimizing this loss, the optimizer will attempt to make this loss "even more negative" by making the log probability more negative (i.e. by making the probability of that action less likely)—so it kind of does what we want.

However, now improbable large negative losses are punished more than the more than likely ones, when we probably want the opposite. Hence, loss = -log(1-probabilities) * reward might be more appropriate when the reward is negative.

score 3 · Answer 2 · edited Nov 01 '20 at 22:39

3

The cross-entropy loss will always be positive because the probability is in the range $[0, 1]$, so $-ln(p)$ will always be positive.

edited Nov 01 '20 at 22:39

nbro

42,615
12
119
217

answered Oct 16 '17 at 15:18

user3711746

31
2

How do I handle negative rewards in policy gradients with the cross-entropy loss function?

2 Answers2

Linked