While studying the proof of the Policy Gradient Theorem, I have come across two different approaches.
The first seems to be a more standard approach involving "unrolling" across every time step. A good resource discussing this method is section 3.1 of this overview. It begins by writing the objective as the expected state value over the initial state distribution $p_0(s)$ $$J(\theta)=\mathbb E_{s\sim p_0}[V^{\pi_\theta}(s)]=\sum_{s\in \mathcal S}p_0(s)\sum_{a\in\mathcal A}\pi_\theta(a|s)Q^{\pi_\theta}(s,a),$$and ends with the standard result $$ \begin{aligned}\nabla_\theta J(\theta)&=\eta\sum_{s\in \mathcal S}d^{\pi_\theta}(s)\sum_{a\in \mathcal A}\nabla_\theta\pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\[8pt]&=\eta\mathbb E_{s\sim d^{\pi_\theta},a\sim\pi_\theta}\left[\nabla_\theta\log\pi_\theta(a|s)Q^{\pi_\theta}(s,a)\right],\end{aligned}$$for some constant $\eta$.
The second approach (discussed in this blog post) views the system as a Markov chain in which each vertex represents a state-action pair. This allows us to express the objective in terms of the Markov chain's transition matrix $\Pi_\theta$ $$J(\theta)=\sum_{t=0}^\infty p^0\Pi_\theta^tr,$$ where $r$ is a column vector such that $r_i$ is the reward of state-action $i$ and $p^0$ is a row vector representing the initial state-action distribution (note that this is slightly different to the initial state distribution $p_0(s)$ mentioned earlier). The proof concludes with $$\nabla_\theta J(\theta)=d\nabla_\theta\Pi_\theta q,$$ where $d$ is a row vector of the visitation frequencies of each state-action under the policy $\pi_\theta$, and $q$ is a column vector giving all the action values $Q^{\pi_\theta}(s,a)$.
My question is this: why are these two results equivalent? More specifically, can you give a step by step proof that $$d\nabla_\theta\Pi_\theta q=\eta\sum_{s\in \mathcal S}d^{\pi_\theta}(s)\sum_{a\in \mathcal A}\nabla_\theta\pi_\theta(a|s)Q^{\pi_\theta}(s,a),$$ making use of the definitions of $\Pi_\theta$ and $d^{\pi_\theta}(s)$?
What I have tried so far: I omit all references to $\theta$ for brevity (e.g. $\pi=\pi_\theta$).
Let each state-action pair have its individual state and action indexed with a superscript (i.e. the state-action corresponding to row/column $i$ of $\Pi$ has state $s^i$ and action $a^i$). Let the state and action at time step $t$ be indexed with subscripts (e.g. $a_t$, $s_t$). Let $\gamma$ be the discount rate (possibly $1$). Then, we have
$$ \begin{aligned} \Pi_{ij}&=\gamma\mathbb P(s^j\mid s^i,a^i)\pi(a^j|s^j)\\ \implies\nabla\Pi_{ij}&=\gamma\mathbb P(s^j\mid s^i,a^i)\nabla\pi(a^j|s^j)\\ d_i&=\sum_{t=0}^\infty\gamma^t\mathbb P(s_t,a_t=s^i,a^i)\\ \implies d\nabla\Pi q&=\sum_i\sum_jd_i\nabla\Pi_{ij}q_j\\ &=\sum_i\sum_j\sum_t\gamma^t\mathbb P(s_t,a_t=s^i,a^i)\gamma\mathbb P(s^j\mid s^i,a^i)\nabla\pi(a^j|s^j)q_j\\ &=\sum_j\sum_t\gamma^{t+1}\left(\sum_i\mathbb P(s_t,a_t,s_{t+1}=s^i,a^i,s^j)\right)\nabla\pi(a^j|s^j)q_j\\ &=\sum_j\sum_t\gamma^{t+1}\mathbb P(s_{t+1}=s^j)\nabla\pi(a^j|s^j)q_j\\ &=\sum_j\tilde d(s^j)\nabla\pi(a^j|s^j)q_j\\ &=\sum_{(s,a)\in\mathcal S\times\mathcal A}\tilde d(s)\nabla\pi(a|s)Q^\pi(a|s)\\ &=\sum_{s\in\mathcal S}\tilde d(s)\sum_{a\in\mathcal A}\nabla\pi(a|s)Q^\pi(a|s) \end{aligned} $$ This is almost exactly what I want. The problem is $$ d^\pi(s)=(1-\gamma)\sum_{t=0}^\infty\gamma^t\mathbb P(s_t=s)\not\propto\sum_{t=0}^\infty\gamma^{t+1}\mathbb P(s_{t+1}=s)=\tilde d(s). $$ Multiplying $d$ by $\nabla\Pi$ effectively "shifts everything along by one" which means that I lose the first term of the series ($t=0$).
Where am I going wrong? Is there a small detail I'm neglecting, or a fundamental misunderstanding? Any help would be greatly appreciated!
Other resources I've read:
- https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
- I found some Reddit discussion of the Markov approach involving the original post's author.