Why is baseline conditional on state at some timestep unbiased?

Question

In the homework for the Berkeley RL class, problem 1, it asks you to show that the policy gradient is still unbiased if the baseline subtracted is a function of the state at time step $t$.

$$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0 $$

I am struggling through what the first step of such a proof might be.

Can someone point me in the right direction? My initial thought was to somehow use the law of total expectation to make the expectation of $b(s_t)$ conditional on $T$, but I am not sure.

score 7 · Answer 1 · edited Jun 10 '20 at 16:42

Using the law of iterated expectations one has:

$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$

written with integrals and moving the gradient inside (linearity) you get

$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$

you can now move $\nabla_\theta$ (due to linearity) and $b(s_t)$ (does not depend on $a_t$) form the inner integral to the outer one:

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t= $

$\pi_\theta(a_t | s_t)$ is a (conditional) probability density function, so integrating over all $a_t$ for a given fixed state $s_t$ equals $1$:

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t = $

Now $\nabla_\theta1 = 0$, which concludes the proof.

Douglas Daseeco · Answer 2 · 2018-09-22T04:30:41.657

It appears that the homework was due two days prior to this answer's writing, but in case it is still relevant in some way, the relevant class notes (which would have been useful if provided in the question along with the homework) are here.

The first instance of expectation placed on the student is, "Please show equation 12 by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p \theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory." Equation 12 is this.

$\sum_{t = 1}^{T} E_{\tau \sim p \theta(\tau)} [\nabla_\theta \log \pi_\theta(a_t|s_t)(b(s_t))] = 0$

The class notes identifies $\pi_\theta(a_t|s_t)$ as the state-action marginal. It is not a proof sought, but a sequence of algebraic steps to perform the decoupling and show the degree to which independence of the state-action marginal can be achieved.

This exercise is a preparation for the next step in the homework and draws only on the review of CS189, Burkeley's Introduction to Machine Learning course, which does not contain the Law of Total Expectation in its syllabus or class notes.

All the relevant information is in the above link for class notes and requires only intermediate algebra.

Why is baseline conditional on state at some timestep unbiased?

2 Answers2