8

I would like to understand the difference between the standard policy gradient theorem and the deterministic policy gradient theorem. These two theorem are quite different, although the only difference is whether the policy function is deterministic or stochastic. I summarized the relevant steps of the theorems below. The policy function is $\pi$ which has parameters $\theta$.

Standard Policy Gradient $$ \begin{aligned} \dfrac{\partial V}{\partial \theta} &= \dfrac{\partial}{\partial \theta} \left[ \sum_a \pi(a|s) Q(a,s) \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \dfrac{\partial Q(a,s)}{\partial \theta} \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \dfrac{\partial}{\partial \theta} \left[ R + \sum_{s'} \gamma p(s'|s,a) V(s') \right] \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \gamma \sum_{s'} p(s'|s,a) \dfrac{\partial V(s') }{\partial \theta} \right] \end{aligned} $$ When one now expands next period's value function $V(s')$ again one can eventually reach the final policy gradient: $$ \dfrac{\partial J}{\partial \theta} = \sum_s \rho(s) \sum_a \dfrac{\pi(a|s)}{\partial \theta} Q(s,a) $$ with $\rho$ being the stationary distribution. What I find particularly interesting is that there is no derivative of $R$ with respect to $\theta$ and also not of the probability distribution $p(s'|s,a)$ with respect to $\theta$. The derivation of the deterministic policy gradient theorem is different:

Deterministic Policy Gradient Theorem $$ \begin{aligned} \dfrac{\partial V}{\partial \theta} &= \dfrac{\partial}{\partial \theta} Q(\pi(s),s) \\ &= \dfrac{\partial}{\partial \theta} \left[ R(s, \pi(s)) + \gamma \sum_{s'} p(s'|a,s) V(s') \right] \\ &= \dfrac{R(s, a)}{\partial a}\dfrac{\pi(s)}{\partial \theta} + \dfrac{\partial}{\partial \theta} \left[\gamma \sum_{s'} p(s'|a,s) V(s') \right] \\ &= \dfrac{R(s, a)}{\partial a}\dfrac{\pi(s)}{\partial \theta} + \gamma \sum_{s'} \left[p(s'|\mu(s),s) \dfrac{V(s')}{\partial \theta} + \dfrac{\pi(s)}{\partial \theta} \dfrac{p(s'|s,a)}{\partial a} V(s') \right] \\ &= \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial}{\partial a} \left[ R(s, a) + p(s'|s,a) V(s') \right] + \gamma p(s'|\pi(s),s) \dfrac{V(s')}{\partial \theta} \\ &= \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial Q(s, a)}{\partial a} + \gamma p(s'|\pi(s),s) \dfrac{V(s')}{\partial \theta} \\ \end{aligned} $$ Again, one can obtain the finaly policy gradient by expanding next period's value function. The policy gradient is: $$ \dfrac{\partial J}{\partial \theta} = \sum_s \rho(s) \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial Q(s,a))}{\partial a} $$ In contrast to the standard policy gradient, the equations contain derivatives of the reward function $R$ and the conditional probability $p(s'|s, a,)$ with respect to $a$.

Question

Why do the two theorems differ in their treatment of the derivatives of $R$ and the conditional probability? Does determinism in the policy function make such a difference for the derivatives?

nbro
  • 42,615
  • 12
  • 119
  • 217
fabian
  • 183
  • 4

1 Answers1

6

In the policy gradient theorem, we don't need to write $r$ as a function of $a$ because the only time we explicitly 'see' $r$ is when we are taking the expectation with respect to the policy. For the first couple lines of the PG theorem we have \begin{align} \nabla v_\pi(s) &= \nabla \left[ \sum_a \pi(a|s) q_\pi (s,a) \right] \;, \\ &= \sum_a \left[ \nabla \pi(a|s) q_\pi(s,a) + \pi(a|s) \nabla\sum_{s',r} p(s',r|s,a)(r+ v_\pi(s')) \right] \; ; \end{align} you can see that we are taking expectation of $r$ with respect to the policy, so we don't need to write something like $r(s,\pi(a|s))$ (especially because this notation doesn't really make sense for a stochastic policy). This is why we don't need to take the derivative of $r$ with respect to the policy parameters. Now, the next line of the PG theorem is $$\nabla v_\pi(s) = \sum_a \left[ \nabla \pi(a|s) q_\pi(s,a) + \pi(a|s)\sum_{s'} p(s'|s,a) \nabla v_\pi(s') \right] \; ;$$ so now we have an equation similar to the bellman equation in terms of the $\nabla v_\pi(s)$'s, so we can unroll this repeatedly meaning we never have to take an explicit derivative of the value function.

For the deterministic gradient, this is a bit different. In general we have $$v_\pi(s) = \mathbb{E}_\pi[Q(s,a)] = \sum_a \pi(a|s) Q(s,a)\;,$$ so for a deterministic policy (denoted by $\pi(s)$ which represents the action taken in state $s$) this becomes $$v_\pi(s) = Q(s,\pi(s))$$ because the deterministic policy has 0 probability for all actions except one, where it has probability one.

Now, in the deterministic policy gradient theorem we can write $$\nabla v_\pi(s) = \nabla Q(s,\pi(s)) = \nabla \left(r(s, \pi(s)) + \sum_{s'} p(s'|s,a)v(s') \right)\;.$$

We have to write $r$ explicitly as a function of $s,a$ now because we are not taking an expectation with respect to the actions because we have a deterministic policy. Now, if you replace where I have written $\nabla$ with the notation you have used for the derivatives you will arrive at the same result and you'll see why you need to use the chain rule, which I believe you understand because your question was more why don't we use the chain rule for the normal policy gradient, which I have hopefully explained -- it is essentially because of how an expectation over the action space works with a deterministic policy vs. a stochastic policy.

Another way to think of this is as follows -- the term you're concerned with is obtained by expanding $\nabla q_\pi(s,a) = \nabla \sum_{s', r}p(s',r|s,a)(r(s,a) + v_\pi(s'))$. Because, by definition of the $Q$ function, we have conditioned on knowing $a,s$ then $a$ is completely independent of the policy in this scenario - we could even condition on an action that the policy would have 0 probability for - thus the derivative of $r$ with respect to the policy parameters is 0.

However, in the deterministic policy gradient we are taking $\nabla q_\pi(s, \pi(s)) = \nabla \left(r(s, \pi(s)) + \sum_{s'} p(s'|s,a) v_\pi(s')\right)$ -- here $r$ clearly depends on the policy parameters because the action taken was the deterministic action given by the policy in the state $s$, thus the derivative wrt the policy parameters is not necessarily 0!

David
  • 5,100
  • 1
  • 11
  • 33