12

On Sutton and Barto's RL book, the reward hypothesis is stated as

that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

Are there examples of tasks where the goals and purposes cannot be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal?

All I can think of are tasks with subjective rewards, like "writing good music", but I am not convinced because maybe this is actually definable (perhaps by some super-intelligent alien) and we just aren't smart enough yet. Thus, I'm especially interested in counterexamples that logically or provably fail the hypothesis.

nbro
  • 42,615
  • 12
  • 119
  • 217
Bananin
  • 221
  • 1
  • 5

4 Answers4

4

What if a scalar reward is insufficient, or its unclear on how to collapse a multi-dimensional reward to a single dimension. Example, for someone eating a burger, both taste and cost are important. Agents may prioritize taste and cost differently, so its not clear on how to aggregate the two. It is also not clear on how a subjective categorical taste value can be combined with a numerical cost.

Kumar
  • 141
  • 2
2

I believe that there is no clear answer to your question. It essentially boils down to whether you are a reductionist – whether you believe that quantitative measurements can truly give justice to the complexity of the real world, and that a framework such as expectation maximization can losslessly capture what we care about as humans in the performing of tasks.

From a non-reductionist perspective, one would be aware that almost any mathematical representation of complex real-world goals will necessarily be a proxy rather than the true goal (as many goals are not mathematically formalizable, such as what we perceive as "good music" or "meaning"), and thus the reward hypothesis is at best an approximation. Based on this, a non-reductionist's reward hypothesis could be rephrased as:

that all of what we mean by goals and purposes can be well thought of approximately operationalized (albeit at a certain domain-dependent loss) as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

Clearly the original (stricter) version of the reward hypothesis does apply to some cases, such as purely-quantitative domains (e.g. maximizing $ earned on the stock market, or maximizing score in a video game), but as soon as the problem involves enough "complexity" (e.g. humans, or wherever you think the boundary should be), a non-reductionist would say that mathematics is clearly not fit to the task to truly capture the intended goal.

More info on the reward hypothesis (as presented by Michael Littman himself) is here. I would have added it as a comment to the question but do not have enough reputation.

mdc
  • 380
  • 1
  • 12
0

The closest counterexamples I can think of are cases where reward shaping is required to learn a good policy but ends up having unintended consequences.

Reward shaping is usually used in cases we want to encourage a particular behavior or when the reward is sparse or when capturing exactly what you want is not straightforward or infeasible. But it is not a good practice to rely too much on it as it can have unintended consequences. A simple example of this is described here https://openai.com/blog/faulty-reward-functions/.

-2

The book sets this hypothesis up by laying out a few assumptions:

In reinforcement learning, the purpose or goal of the agent is formulated in terms of a special signal called the reward, passing from the environment to the agent. At each time step, the reward is a simple number.

We could think about what counterexamples to those assumptions might be:

  1. The reward signal originates internally, instead of originating from the environment. (e.g. meditation, or abstract introspection)
  2. The signal is not received every time step, or isn't necessarily expected to be received at all. (e.g. seeking of transcendent experiences)

What might be common for these counterexamples is that the reinforcement learning mechanism itself undergoes spontaneous change. A signal that would have been positive before the spontaneous change might now be negative. The reward landscape itself might be completely different. From the agent's perspective, it might be impossible to evaluate what changed. The agent might have a 'subconscious' secondary algorithm that introduces changes in the learning algorithm itself, in a way that's decoupled from any reward-defined behavior.

nbro
  • 42,615
  • 12
  • 119
  • 217
bey
  • 11
  • 2