4

In the context of a Markov decision process, this paper says

it is well-known that the optimal policy is invariant to positive affine transformation of the reward function

On the other hand, exercise 3.7 of Sutton and Barto gives an example of a robot in a maze:

Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

It seems like the robot is not being rewarded for escaping quickly (escaping in 10 seconds gives it just as much reward as escaping in 1000 seconds). One fix seems to be to subtract 1 from each reward, so that each timestep the robot stays in the maze, it accumulates $-1$ in reward, and upon escape it gets zero reward. This seems to change the set of optimal policies (now there are way fewer policies which achieve the best possible return). In other words, a positive affine transformation $r \mapsto 1 \cdot r - 1$ seems to have changed the optimal policy.

How can I reconcile "the optimal policy is invariant to positive affine transformation of the reward function" with the maze example?

nbro
  • 42,615
  • 12
  • 119
  • 217
IssaRice
  • 181
  • 4

3 Answers3

2

This statement:

(it is well-known that the optimal policy is invariant to positive affine transformation of the reward function).

is, as far as I know, and as you summarise, incorrect* because simple translations to reward signal do affect the optimal policy, and the affine transform of a real number $x$ can be given by $f(x) = mx + c$

It is well known that optimal policy is unaffected by multiplying all rewards by a positive scaling, e.g. $f(x) = mx$ where $m$ is positive.

It is also worth noting that if an optimal policy is derived from Q values using $\pi(s) = \text{argmax}_a Q(s,a)$, then that policy function is invariant to positive affine transformations of action values given by $Q(s,a)$. Perhaps that was what the paper authors meant to write, given that they go on to apply normalisation to Q values.

The impact of the mistake is not relevant to the the rest of the paper as far as I can see (caveat: I have not read the whole paper).


* It is possible to make the statement correct even for episodic problems, if:

  • You model episodic problems with an "absorbing state" and treat it as a continuing problem.

  • You apply the same affine transform to the (usually zero reward) absorbing state.

  • You still account for the infinite repeats of the absorbing state (requiring a value of discount factor $\gamma$ less than one). In practice this means either granting an additional reward of $\frac{b}{1-\gamma}$ for ending the episode, or not ending a simulation in the terminal state, but running learning algorithms over repeated time steps whilst still in the terminal state, so they can collect the non-zero reward.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
1

In framing the problem as an episodic reinforcement learning problem, the goal is to find a policy that optimizes $\mathbb{E}[\sum_{t=0}^\tau r(s_t)],$ where $\tau$ is the random time at which the robot leaves the maze. This implicitly assigns a reward of 0 to the out-of-maze state, $s_{terminal}$. If you include this state then the transformation $r\rightarrow 1\cdot r - 1$ does not change the optimal policy.

If we rewrite this episodic objective accounting for $s_{terminal}$ (and a horizon $H$) we get the following objective: \begin{align*}\mathbb{E}\left[\sum_{t=0}^\tau r(s_t) + \sum_{\tau}^H r(s_{terminal})\right] &= \mathbb{E}\left[(H-\tau) r(s_{terminal}) + \sum_{t=0}^\tau r(s_t)\right]\\ &= \mathbb{E}\left[(H-\tau) r(s_{terminal}) + r(s_{goal}) + (\tau-1) r(s_{maze}) \right]\end{align*}

Where $s_{goal}$ is the exit state from the maze, the goal, and $s_{maze}$ represents the other states of the maze. In the question, 1 is subtracted from $s_{goal}$ and $s_{maze}$, but not $s_{terminal}$. Thus, this is not a positive affine transformation of the reward function. In effect, this changes the relative value of $s_{terminal}$ from $\min_s r(s)$ to $\max_s r(s)$, and that changes the optimal policy.

-1

Neil answer is wrong, Dylan gives the correct one!

Optimal policies are invariant under positive affine transformations of the reward function. and the reason why it's not the case in your example is explained in Dylan's answer.

Reference: From the book Artificial intelligence a modern approach 4th edition 16.1.3

"the scale of utilities is arbitrary: an affine transformation leaves the optimal decision unchanged. We can replace U(s) by U'(s) = m U(s)+ b where m and b are any constants such that m > 0. It is easy to see, from the definition of utilities as discounted sums of rewards, that a similar transformation of rewards will leave the optimal policy unchanged in an MDP"

** Edit:

  • Since the answer has been updated, the answer is fully correct now :)