This is for Exercise 3.11 from Sutton and Barto's book.
If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?
The question already has an answer here. My approach is slightly different as I did not know the tower rule (which David's answer is based on) when starting this exercise. I would appreciate some inputs on whether my solution is valid.
According to the rules of conditional probability with multiple conditions (worked out in detail in this answer), we have
$$ P(A|B,\theta) = \frac{P(A,B | \theta)}{P(B|\theta)}$$
Applying this to the four-argument function, $p(s',r|s,a) = p(s',r|a,s)$, we have:
$$ p(s',r|s,a) = \frac{p(s',r,a|s)}{p(a|s)} $$
We recognize that the denominator in the RHS above is $\pi$. Substituting, we get: $$ p(s',r,a|s) = p(s',r|s,a) . \pi(a|s) $$
We get $p(r|s)$ by summing the LHS over $s'$ and $a$:
$$ p(r|s) = \sum_{s' \in S} \sum_{a \in A} p(s',a,r|s) =\sum_{s' \in S} \sum_{a \in A} p(s',r|s,a).\pi(a|s) $$
We compute the expectation as $$ \mathbf{E}[R_{t+1}|S_t =s] = \sum_{r \in R} r.P(r|S_t = s) $$
The subscripts in the notation are mainly due to convention.
Substituting above for $p(r|s)$ from the previous equation gives: $$ \mathbf{E}[R_{t+1} | s] = \sum_{r \in R} r.\sum_{s' \in S} \sum_{a \in A} p(s',r|s,a).\pi(a|s) $$