1

My RL project has all positive continuous rewards for every step and the goal is to have the maximum cumulative reward (episodic reward). The problem is that the rewards are too close and all between 5 and 6, therefore achieving the optimum episodic reward will be harder.

What scaling methods are recommended? (like min-max scaling or reward ** 3)

How can I emphasize the episodic reward?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

0

I’ll try to find where I found it, but normalizing the rewards has always worked for me. Assuming you have a list of the discounted returns for each action, you subtract the whole list by its average value then divide it by its standard deviation. In Python with NumPy, that would look like:

returns -= np.mean(returns)
returns /= np.std(returns)

This puts the returns in a small and consistent range that keeps learning similar with different rewards.

S2673
  • 600
  • 4
  • 17
0

I ran into a similar issue before. Rewards were all clustered close together, like between 5 and 6, and it made it really hard for the agent to learn anything meaningful. Min-max scaling could helps if you want to keep everything in a standard range, but honestly I found non-linear scaling (like squaring or cubing) made more of a difference in my case.

Also, depending on your setup, it might help to focus less on per-step rewards and more on the total episode reward. I sometimes zero out step rewards and just give a final one at the end—forces the agent to care about the full trajectory.

Hope that helps.

Jien Weng
  • 69
  • 4