7

If we shift the rewards by any constant (which is a type of reward shaping), the optimal state-action value function (and so optimal policy) does not change. The proof of this fact can be found here.

If that's the case, then why does a negative reward for every step encourage the agent to quickly reach the goal (which is a specific type of behavior/policy), given that such a reward function has the same optimal policy as the shifted reward function where all rewards are positive (or non-negative)?

More precisely, let $s^*$ be the goal state, then consider the following reward function

$$ r_1(s, a)= \begin{cases} -1, & \text{ if } s \neq s^*\\ 0, & \text{ otherwise} \end{cases} $$

This reward function $r_1$ is supposed to encourage the agent to reach $s^*$ as quickly as possible, so as to avoid being penalized.

Let us now define a second reward function as follows

\begin{align} r_2(s, a) &\triangleq r_1(s, a) + 1\\ &= \begin{cases} 0, & \text{ if } s \neq s^*\\ 1, & \text{ otherwise} \end{cases} \end{align}

This reward function has the same optimal policy as $r_1$, but does not incentivize the agent to reach $s^*$ as quickly as possible, given that the agent does not get penalized for every step. So, in theory, $r_1$ and $r_2$ lead to the same behavior. If that's the case, then why do people say that $r_1$ encourage the agents to reach $s^*$ as quickly as possible? Is there a proof that shows that $r_1$ encourages a different type of behaviour than $r_2$ (and how is that even possible given what I have just said)?

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

1

Your examples are equivalent. But it is possible to find a constant yielding a different optimal policy.

Your examples are absolutely equivalent. The agent maximizes the reward, and only way to do so is by reaching $s^*$.

Consider $r_3$ :

$$ r_3(s, a)= \begin{cases} 1, & \text{ if } s \neq s^*\\ 2, & \text{ otherwise} \end{cases} $$

With a sufficiently large $\gamma$, moving infinitely without reaching $s^*$ is now the optimal solution.

For the generic case

$$ r_4(s, a)= \begin{cases} \alpha, & \text{ if } s \neq s^*\\ \beta, & \text{ otherwise} \end{cases} $$

the threshold is found by comparing the results of the series $\alpha + \alpha^2 + \alpha^3 + ... + \alpha^{t_m}$, where $t_m$ is the maximum episode length, and $\alpha + \alpha^2 + \alpha^3 + ... + \alpha^{t^*}$, where $t^*$ is the length of the episode following the fastest policy.

In the example of $r_3$, it is trivial to find examples where the fastest policy isn't optimal. Imagine a race, the agent starts on the left and gets either $\alpha$ or $\beta$ points, depending on where it is. With $\gamma = 0.9$ and no time-limit (infinite episodes) the optimal policy is to move randomly, but in the second-to-last house, avoid the goal state. With $\gamma = 0.1$, the optimal policy is to move randomly (not really, probably there would be a slight advantage in moving right), but in the second-to-last house, enter the goal.

enter image description here

BlueMoon93
  • 906
  • 5
  • 16