How do policy gradients work?

Question

If I understand it correctly from the following equation

$$U(\theta)=\mathbb{E}_{\tau \sim P(\tau;\theta)}\left [ \sum_{t=0}^{H-1}R(s_t,u_t);\pi_{\theta} \right ]=\sum_{\tau}P(\tau;\theta)R(\tau)$$

from this paper, the utility of a policy parameterized by weights $\theta$, is the total expected reward following (all, one?) trajectory $\tau$. After rearranging, taking the gradient, and running the policy over $m$ trials, the equation becomes

$$\nabla_{\theta}U(\theta)\approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\nabla_{\theta}logP(\tau^{(i)};\theta)R(\tau^{(i)}).$$

My question is - how does this work? I can't understand it in terms of the utility/loss surface. At most, doesn't taking multiple trials give you a more accurate guess of the point $(\theta,U(\theta))$ alone? I don't get how you can get the gradient (a sort of local tangent plane of a surface) without evaluating the function at different $\theta$'s, like how do you get the slope at a point (2,3) without evaluating it at 2.01, 1.99, etc. at the very least? Is it by virtue of the well defined geometry of the function $U$ we defined?

score 2 · Accepted Answer · answered Apr 25 '23 at 08:41

From the equation you wrote:

$U(\theta)$ is the total expected reward of a policy $\pi_\theta$ (with parameters/weights $\theta$) over all possible trajectories $\tau$ sampled under the environment dynamics while following the policy itself: the term $P(\tau;\theta)$ should include both the env. dynamics (i.e. how it evolves each step given an action from a state), usually denoted as $P(s'\mid s, a)$ where $s'$ is next state, and the fact that we sample actions from the policy $a\sim \pi_\theta(s)$ from a state $s$.
For practical purpose, you can't take the expectation over all possible trajectories so you run the policy over $m$ of them and approximate the expectation (integral) with a summation over finite num. of trajectories. Thus, $\hat g = \frac{1}{m}\sum_{i=1}^m\nabla_\theta \log P(\tau^{(i)};\theta)R(\tau^{(i)})$.

My question is how does this work?

Basically the formula that estimates the gradient $\hat g$ makes use of what is called a score estimator (also called REINFORCE from the notable algorithm). It works by computing the probability of a trajectory to occur, then you score that trajectory by $R(\tau^{(i)})$ the total (discounted) reward, and finally compute the expectation as a simple average over $m$ trajectories.

The gradient of the policy $\nabla_\theta$ is so scored by the log-prob times the reward: the log-prob term has the intuitive explanation of increasing the probability of trajectories that score a high $R(\tau)$, while decreasing the ones for which $R$ is bad.

doesn't taking multiple trials give you a more accurate guess of the point $(\theta,U(\theta))$ alone?

Increasing $m$ provides a better estimate of $\hat g$. Indeed, as $m\to\infty$ you recover the true gradient $g$ by computing the full expectation. But that is not feasible in practice.

I don't get how you can get the gradient without evaluating the function at different $\theta$'s

At each update/optimization step the policy is only evaluated at $\theta$: its weights define both the point at which the gradient is computed and the policy itself, since it's a parametrized function approximator (like a neural network.) So, there is no need to know/evaluate at the surroundings of $\theta$.

How do policy gradients work?

1 Answers1