5

In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3 equations).

enter image description here

Why can we assume that the sum $\sum_s\eta(s)$ is a constant of proportionality? Doesn't it also depend on $\theta$, since it depends on the policy $\pi$?

What could make sense, would be to say that $\nabla J(\theta) = \mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]\propto \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.

Since the proportionality constant is always $\ge 0$ (average time spent in an episode), any update direction suggested by $\mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$ is the same as $\mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$, but with different amplitude. This, however, wouldn't impact the learning process too much, since we multiply the update term with a low learning rate anyway.

Hence, as it is more easy to sample states from $d(s)$, we just set $\nabla_{\theta} J = \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.

Could that serve as plausible explanation?

pythonic833
  • 332
  • 2
  • 9
jwl17
  • 59
  • 2

2 Answers2

1

The answer is: $\sum_{s} \eta(s)$ is not a constant with regards to $\theta$. As you already mentioned \begin{equation} \sum_{k=0}^{\infty} \text{Pr}(s_{0}\rightarrow s, k, \pi) = \eta(s). \end{equation} But then the question arises: Why is the derivation you showed correct? It's not the point that $\sum_{s} \eta(s)$ is constant in regards to $\theta$ and is therefore not affected by the $\nabla$ operator but quite the contrary: That it is the result of applying the $\nabla$ operator to the state value function. If you just look at the Proof of the Policy Gradient Theorem (episodic case) in Sutton-Barto at the simplification of $\nabla v_{\pi}(s)$ you will find that (not showing all steps here): \begin{align} \nabla v_{\pi}(s) &= \nabla \left[ \sum_{a} \pi(a|s)q_{\pi}(s, a) \right] &\text{for all s} \in S \\ &=~... \\ &= \sum_{a}\left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'}p(s'|s,a)\nabla v_{\pi}(s')\right] & (\text{using recursivity}) \\ &=\sum_{a}\left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'}p(s'|s,a) \\ \sum_{a'}[\nabla \pi(a'|s')q_{\pi}(s', a') + \pi(a'|s')\sum_{s''}p(s''|s',a')\nabla v_{\pi}(s'')]\right] & (\text{further unrolling}) \\ &= \sum_{x \in S}\sum_{k=0}^{\infty} \text{Pr}(s\rightarrow x, k, \pi)\sum_{a}\nabla \pi(a|x)q_{\pi}(x,a). \end{align} Where $\text{Pr}(s\rightarrow x, k, \pi)$ is the probability of moving from state $s$ to state $x$ accounting for all intermediate states $s_{1}, s_{2}, ..., s_{k}$ by following policy $\pi$. So we see, that this equation from which you start is the result NOT the target of a derivative.

pythonic833
  • 332
  • 2
  • 9
0

Spitballing some ideas here -

as $\pi_\theta$ is updated using some gradient optimization process with small steps in the direction of the gradient, the changes to state visits are generally small as well. If so, maybe we can assume that the state-visitation frequency $\eta(s)$ will not change significantly. To be precise, while $\eta(s)$ is in fact $\eta(s,\theta)$, is it a valid assumption that when the step size is small enough, $\eta(s,\theta_t)\approx\eta(s,\theta_{t+1})$ for all $t$.

This may bring to mind a Quasistatic process, which is a thermodynamic process that happens slowly enough for the system to remain in equilibrium, but hey thats just my intuition

Hadar Sharvit
  • 381
  • 1
  • 13