7

In this video, the lecturer states that $R(s)$, $R(s, a)$ and $R(s, a, s')$ are equivalent representations of the reward function. Intuitively, this is the case, according to the same lecturer, because $s$ can be made to represent the state and the action. Furthermore, apparently, the Markov decision process would change depending on whether we use one representation or the other.

I am looking for a formal proof that shows that these representations are equivalent. Moreover, how exactly would the Markov decision process change if we use one representation over the other? Finally, when should we use one representation over the other and why are there three representations? I suppose it is because one representation may be more convenient than another in certain cases: which cases? How do you decide which representation to use?

nbro
  • 42,615
  • 12
  • 119
  • 217

3 Answers3

8

In general the different reward functions $R(s)$, $R(s, a)$ and $R(s, a, s')$ are not equivalent mathematically, so you will not find any formal proof.

It is possible for the functions to resolve to the same value in a specific MDP, if, for instance, you use $R(s, a, s')$ and the value returned only depends on $s$, then $R(s, a, s') = R(s)$. This is not true in general, but as the reward functions are often under your control, it can be the case quite often.

For instance, in scenarios where the agent's goal is to reach some pre-defined state, as in the grid world example from the video, then there is no difference between $R(s, a, s')$ or $R(s)$. Given that is the case, for those example problems you may as well use $R(s)$, as it simplifies the expressions that you need to calculate for algorithms like Q-learning.

I think the lecturer did not mean "equivalent" in the mathematical sense, but in the sense that future lectures will use one of the functions, and a lot of what you will learn is going to be much the same as if you had used a different reward function.

Finally, when should we use one representation over the other and why are there three representations?

Typically, I don't use any of those representations by default. I tend to use Sutton & Barto's $p(s', r|s, a)$ notation for combined state transitions and rewards. That expression returns probability of transitioning to state $s'$ and receiving reward $r$ when starting in state $s$ and taking action $a$. For discrete actions, you can re-write the expectation of the different functions $R$ in terms of this function as follows:

$$\mathbb{E}[R(s)] = \sum_{a \in \mathcal{A}(s)}\sum_{s' \in \mathcal{S}}\sum_{r \in {R}}rp(s', r|s, a)\qquad*$$

$$\mathbb{E}[R(s,a)] = \sum_{s' \in \mathcal{S}}\sum_{r \in {R}}rp(s', r|s, a)$$

$$\mathbb{E}[R(s,a, s')] = \sum_{r \in {R}}r\frac{p(s', r|s, a)}{p(s'|s, a)}$$

I think this is one way to see how the functions in the video are closely related.

Which one would you use? Depends on what you are doing. If you want to simplify an equation or code, then use the simplest version of the reward function that fits with the reward scheme you set up for the goals of the problem. For instance, if there is one goal state to exit a maze, and an episode ends as soon as this happens, then you don't care how you got to that state or what the previous state was, and can use $R(s)$

In practice, what happens if you use a different reward function is that you need to pay attention to where it appears in things like the Bellman equation for theoretical treatments. When you get to implement model-free methods like Q-learning, $R(s)$ or its variants don't really appear except in the theory.


* This is not technically correct in all cases. The assumption I have made is that $R(s)$ is a reward granted at the point of leaving state $s$, and is independent of how the state is left and where the agent ends up next.

If this was a fixed reward for entering state $s$, regardless of how, then it could be written around $R(s')$ as follows:

$$\mathbb{E}[R(s')] = \sum_{s \in \mathcal{S}}\sum_{a \in \mathcal{A}(s)}\sum_{r \in {R}}rp(s', r|s, a)$$

i.e. by summing all the rewards that end up at $s'$

nbro
  • 42,615
  • 12
  • 119
  • 217
Neil Slater
  • 33,739
  • 3
  • 47
  • 66
5

Let $R(s)$ denote a probability distribution over rewards that our agent may get in some MDP as a reward for entering a state $s$. The easiest case is to demonstrate that we can also choose to write this as $R(s, a)$ or $R(s, a, s')$: simply take $\forall a: R(s, a) = R(s)$, or $\forall a \forall s': R(s, a, s') = R(s)$, as also described in Neil's answer.


Let $R(s, a)$ denote a probability distribution over rewards that our agent may get as a reward for executing action $a$ in state $s$. The easy case of demonstrating equivalence to $R(s, a, s')$ is already handled above, but can we also construct an MDP in which we only use the $R(s)$ notation?

The easiest way I can think of to do so (may not be the cleanest way) would be to construct a new MDP with a bunch of "dummy" states $z(s, a)$, such that executing action $a$ in state $s$ in the original MDP deterministically leads to a dummy state $z(s, a)$ in the new MDP. Note that I write $z(s, a)$ to make the connection back to the original MDP explicit, but this is a completely independent MDP and you should view it as just a state "$z$".

Then, the reward distribution $R(s, a)$ that was associated with the state-action pair $(s, a)$ in the original MDP can be written as $R(z(s, a))$, which is now only a function of the state in the new MDP. In this dummy state $z(s, a)$ in the new MDP, every possible action $\alpha$ should have exactly the same transition probabilities towards new states $s'$ as the original transition probabilities for executing $a$ in $s$ back in the original MDP. This guarantees that the same policy has the same probabilities of reaching certain states in both MDPs; only in our new MDP the agent is forced to transition through these dummy states in between.

If you also have a discount factor $\gamma$ in the original MDP, I guess you should use a discount factor $\sqrt{\gamma}$ in the new MDP, because every step in the original MDP requires two steps (one step into a dummy state, and one step out of it again) in the new MDP.


The final case for $R(s, a, s')$ could be done in a very similar way, but it would get even more complicated to write out formally. The intuition would be the same though. Above, we pretty much "baked" state-action pairs $(s, a)$ from the original MDP into additional dummy states, such that in the new MDP we have states that "carry the same amount of information" as a full state-action pair in the original MDP. For the $R(s, a, s')$ case, you'd need to devise an even uglier solution with even more information "baked into" dummy states, such that you can treat full $(s, a, s')$ triples as single $z(s, a, s')$ states in the new MDP.


Finally, when should we use one representation over the other and why are there three representations? I suppose it is because one representation may be more convenient than another in certain cases: which cases? How do you decide which representation to use?

I would recommend always using the simplest representation that happens to be sufficient to describe how the rewards work in your environment in a natural way.

For example, if you have a two-player zero-sum game where terminal game states give a reward of $-1$, $0$, or $1$ for losses, draws, or wins, it is sufficient to use the $R(s)$ notation; the reward depends on the terminal game state reached, not on how it was reached. Another example would be a maze with a specific goal position, as described in Neil's answer. You could use the more complex $R(s, a)$ or $R(s, a, s')$ notations... but there wouldn't be much of a point in doing so really.

If you have an environment in which the reached state and the played action both have influence on the reward distribution, then it's much more sensible to just use the $R(s, a)$ notation rather than trying to define a massively overcomplicated MDP with dummy states as I've tried to do above. An example would be... let's say we're playing a quizz, where $s$ denotes the current question, and different actions $a$ are different answers that the agent can give. Then it's natural to model the problem as an MDP where $R(s, a)$ is only positive if $a$ is the correct answer to the question $s$.

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70
0

I like to think of these formulations as equivalent a bit differently. Intuitively, for any reward function $R(s,a,s')$ you can always build an equivalent MDP that allows rewards to come in one timestep later than in the original MDP that uses a $R(s)$ notation, with the $R(s,a)$ case following naturally.


Formal derivation

Let $M$ be an MDP with reward function $R(s,a,s')$ and transition function $T(s,a,s')$. We will now convert this into an MDP $\hat{M}$ with a reward function of shape $R(s)$ and show that any trajectory in $M$ has the same discounted reward in $\hat{M}$.

Let the state space in $\hat{M}$ be $\hat{s}_t = (s_{t-1}, a_{t-1}, s_t)$ and the reward be $\hat{R}(\hat{s}_t)=\frac{1}{\gamma} R(s_{t-1}, a_{t-1}, s_t)$. To handle edge cases, assume that $\hat{R}(\hat{s}_0)=0$ and after every terminal state $s_T$ add a transition to an absorbing state $\hat{s}_{T+1}$ so that the final reward can be collected. We then have

1

This implies that under the new MDP $\hat{M}$, trajectories all still have the same discounted return. The only conceptual differences to note is that reward is collected one timestep later than in $M$ and that the policy now has access to $(s_{t-1}, a_{t-1})$ when taking its decision, both of which do not leak any useful or actionnable information for what to do in state $s_t$ by the assumption that $T$ and $R$ are Markovian in $M$.

The same argument applies to a $\hat{R}(\hat{s},a)$ reshaping instead by posing $\hat{R}(\hat{s},a) = \hat{R}(\hat{s})$ as above.


I personally find this way of showing the equivalence more satisfying because it does not imply that rewards are deterministic for $s$ or $(s,a)$ in the original MDP $M$ as you can do with taking the expectation over next states. This way you truly have the property that the way in which you reached state $s'$ changes your overall reward for instance.