1

The following is from page 17 of "Michael Hu, “The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python”, Apress, 2023"

https://link.springer.com/book/10.1007/978-1-4842-9606-6

An example of good reward engineering is in the game of Atari Breakout, where the goal of the agent is to clear all the bricks at the top of the screen by bouncing a ball off a paddle. One way to design a reward function for this game is to give the agent a positive reward for each brick it clears and a negative reward for each time the ball passes the paddle and goes out of bounds. However, this reward function alone may not lead to optimal behavior, as the agent may learn to exploit a loophole by simply bouncing the ball back and forth on the same side of the screen without actually clearing any bricks.

This part is not clear:

However, this reward function alone may not lead to optimal behavior, as the agent may learn to exploit a loophole by simply bouncing the ball back and forth on the same side of the screen without actually clearing any bricks.

Why should the agent bounce the ball back and forth on the same side of the screen? Is there a reward in this case?

nbro
  • 42,615
  • 12
  • 119
  • 217
DSPinfinity
  • 1,223
  • 4
  • 10

1 Answers1

2

There might be some cases, say, once the agent clearing a brick obtaining a positive reward and then forever bouncing the ball on the same side of the screen would have a higher cumulative reward than trying to clear as many bricks as possible but then got many negative rewards which might be significant compared to positive reward. It's like a conservative game player, once winning a game initially then stops playing completely. Or it's like the case if getting a downvote of an answer is equal to getting an upvote in terms of scores, then many more people would be reluctant to answer any question.

Therefore appropriate reward engineering is crucial. In certain scenarios a poorly designed reward function might inadvertently incentivize the agent to adopt suboptimal strategies that maximize cumulative rewards without truly achieving the main objectives of the task.

cinch
  • 11,000
  • 3
  • 8
  • 17