3

I am learning about policy gradient methods from the Deep RL Bootcamp by Peter Abbeel and I am a bit stumbled by the math presented. In the lecture, he derives the gradient logarithm likelihood of a trajectory to be

$$\nabla log P(\tau^{i};\theta) = \Sigma_{t=0}\nabla_{\theta}log\pi(a_{t}|s_t, \theta).$$

Is $\pi(a_{t} | s_{t}, \theta)$ a distribution or a function? Because a derivative can only be taken wrt a function. My understanding is that $\pi(a_{t},s_{t}, \theta)$ is usually represented as a distribution of actions over states, since input of a neural network for policy gradient would be the $s_t$ and output would be $\pi(a_t, s_t)$, using model weights $\theta$.

nbro
  • 42,615
  • 12
  • 119
  • 217
calveeen
  • 1,311
  • 9
  • 18

1 Answers1

4

First, the derivative is usually taken with respect to a variable (input) of the function. Hence the notation $\frac{df}{dx}$ for some function $f(x)$.

If you look at your equation more carefully

$$\nabla log P(\tau^{i};\theta) = \Sigma_{t=0}\nabla_{\theta}log\pi(a_{t}|s_t, \theta).$$

You will see that the gradient is taken with respect to $\theta$, which are the parameters (i.e. a vector) e.g. of your neural network, that is, $\nabla_{\theta}$.

In this case, it doesn't really matter whether $\pi$ represents a distribution or not (for some specific value of $\theta$), but you're right that $\pi$ often represents a probability distribution over the possible actions (given a specific state). In any case, $\pi$ is a function of the parameters $\theta$ (i.e. in the case of a distribution, $\pi_{\theta}$ is a family of distributions for all possible values of $\theta$), i.e. if you change $\theta$ the outputs of $\pi$ also change, so you can take the derivative of it with respect to $\theta$.

nbro
  • 42,615
  • 12
  • 119
  • 217