5

Sutton and Barto define the state–action–next-state reward function, $r(s, a, s')$, as follows (equation 3.6, p. 49)

$$ r(s, a, s^{\prime}) \doteq \mathbb{E}\left[R_{t} \mid S_{t-1}=s, A_{t-1}=a, S_{t}=s^{\prime}\right]=\sum_{r \in \mathcal{R}} r \frac{p(s^{\prime}, r \mid s, a )}{\color{red}{p(s^{\prime} \mid s, a)}} $$

Why is the term $p(s' \mid s, a)$ required in this definition? Shouldn't the correct formula be $\sum_{r \in \mathcal{R}} r p(s^{\prime}, r \mid s, a )$?

nbro
  • 42,615
  • 12
  • 119
  • 217
SAGALPREET SINGH
  • 147
  • 1
  • 10

2 Answers2

8

Expectation of reward after taking action $a$ in state $s$ and ending up in state $s'$ would simply be

\begin{equation} r(s, a, s') = \sum_{r \in R} r \cdot p(r|s, a, s') \end{equation}

The problem with this is that they do not define probability distribution for rewards separately, they use joint distribution $p(s', r|s, a)$, which represents probability for ending up in state $s'$ with reward $r$ after taking action $a$ in state $s$. This probability can be separated in 2 parts using product rule

\begin{equation} p(s', r|s, a) = p(s'|s, a)\cdot p(r|s', s, a) \end{equation}

which represents the probability for getting to state $s'$ from $(s, a)$, and then probability for getting reward $r$ after ending up in $s'$.

If we define reward expectation through the joint distribution, we would have

\begin{align} r(s, a, s') &= \sum_{r \in R} r \cdot p(s', r|s, a)\\ &= \sum_{r \in R} r \cdot p(s'|s, a) \cdot p(r|s', s, a) \end{align}

but this would not be correct, since we have this extra $p(s'|s, a)$, so we divide everything by it to get expression with only $p(r|s', s, a)$.

So, in the end we have

\begin{equation} r(s, a, s') = \sum_{r \in R} r \frac{p(r, s'|s, a)}{p(s'|s, a)} \end{equation}

nbro
  • 42,615
  • 12
  • 119
  • 217
Brale
  • 2,416
  • 1
  • 7
  • 15
3

$\frac{p(s', r \mid s, a)}{p(s' \mid s, a)}$ represents the probability of observing reward $r$ in state $s'$, given that state $s'$ is the next state transitioned to. The equation assumes a probability distribution of rewards $r$ over state $s'$, meaning that a different reward might be observed whenever a state transitions from $s$ to $s'$. In most cases, if $r(s, a, s')$ is a deterministic reward then $p(s', r \mid s, a) = p(s' \mid s,a )$.

nbro
  • 42,615
  • 12
  • 119
  • 217
calveeen
  • 1,311
  • 9
  • 18