2

I am a newbie to reinforcement learning (RL) and am currently developing my own RL agent using the Stable-Baselines3 implementation of PPO. My work is based on a reward function that incorporates multiple sub-rewards. While tackling the problem, I’ve encountered several challenges for which I couldn’t find clear answers.

I’ve come across some discussions where engineers mentioned that using negative rewards might lead to different agent behavior compared to using positive rewards. Is this true? If so, how might the agent’s behavior differ?

Jay
  • 51
  • 3

2 Answers2

2

I think it depends on how you define or scale your reward function and the nature of the environment (episodic vs continuous).

In this blog, which I wrote a few years ago, I show that, if you add a constant to all rewards, you get the same Bellman equation, which suggests that the actual value of the rewards might not be important but only their relative value, i.e. the set of optimal policies does not change if you scale all rewards with the same constant. However, in practice, episodes are usually finite and scaling the rewards may have some practical consequences, but I cannot say exactly which in all cases, but I show in the blog post that this can lead to reward misspecification.

Anyway, back in 1999, Ng et al. showed that potential-based reward shaping does not change the set of optimal policies, so you might want to look into this if you have sparse rewards. I also wrote another related answer here.

For some tips on how to define the reward functions, see e.g. my answer here.

nbro
  • 42,615
  • 12
  • 119
  • 217
1

Negative reward in theory is fine since the agent's objective remains the same to maximize the total return and no where in the generalized policy iteration process it prohibits negative rewards. Your engineers might talk about training unstability with possibly large accumulated negative rewards during batch or online training which cause exploding or vanishing gradients for gradient-based RL algorithms or other related numerical issues. In practice to avoid such issues as much as possible, you should scale your negative rewards to a small range, say $[-1,0]$.

cinch
  • 11,000
  • 3
  • 8
  • 17