Why is the derivative of this objective function 0 if the policy is deterministic?

Question

In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.

$$ \nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right] $$

Why is that?

score 13 · Accepted Answer · edited Jan 01 '22 at 13:00

Here is the gradient that they are discussing in the video:

$$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right)$$

In this equation, $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t})$ denotes the probability of our policy $\pi_{\theta}$ selecting the actions $\mathbf{a}_{i, t}$ that it actually ended up selecting in practice, given the states $\mathbf{s}_{i, t}$ that it encountered during the episode that we're looking at.

In the case of a deterministic policy $\pi_{\theta}$, we know for sure that the probability of it selecting the actions that it did select must be $1$ (and the probability of it selecting any other actions would be $0$, but such a term does not show up in the equation). So, we have $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) = 1$ for every instance of that term in the above equation. Because $\log 1 = 0$, this leads to:

\begin{aligned} \nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log 1 \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} 0 \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N 0 \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= 0 \\ \end{aligned}

(i.e. you end up with a sum of terms that are all multiplied by $0$).

16Aghnar · Answer 2 · 2018-09-06T14:55:21.640

Well, I'd rather comment, but I don't have yet this privilege, so here are some comments.

First, having a deterministic policy inside the log would do create trivial terms.

Secondly, for me, in Policy Gradient methods, it's a non sense to have a deterministic policy during the optimization, because you want to explore the space of weights. In my experience, you only set the policy to deterministic (in PG method) when you're done with the optimization, and you want to test your network.

Why is the derivative of this objective function 0 if the policy is deterministic?

2 Answers2