5

To derive the policy gradient, we start by writing the equation for the probability of a certain trajectory (e.g. see spinningup tutorial):

$$ \begin{align} P_\theta(\tau) &= P_\theta(s_0, a_0, s_1, a_1, \dots, s_T, a_T) \\ & = p(s_0) \prod_{i=0}^T \pi_\theta(a_i | s_i) p(s_{i+1} | s_i, a_i) \end{align} $$

The expression is based on the chain rule for probability. My understanding is that the application of the chain rule should give up this expression:

$$ p(s_0)\prod_{i=0}^T \pi_\theta(a_i|s_i, a_{i-1}, s_{i-1}, a_{i-2}, \dots, s_0, a_0) p(s_{i+1} | s_i, a_i, s_{i-1}, a_{i-1}, \dots, a_0, s_0) $$

Then the Markov property should be applicable, producing the desired equality. This should only depend on the latest state-action pair.

Here are my questions:

  1. Is this true?

  2. I watched this lecture about policy gradients, and at this time during the lecture, Sergey says that: "at no point did we use the Markov property when we derived the policy gradient", which left me confused. I assumed that the initial step of calculating the trajectory probability was using the Markov property.

nbro
  • 42,615
  • 12
  • 119
  • 217
Gerges
  • 151
  • 3

2 Answers2

0

I think the equation doesn't check out, when $i$ runs to $T$, we don't have $s_{T+1}$ to plug in to $s_{i+1}$ in your first equation. Sorry it's not an answer.

Son Do
  • 1
0

Sergey Levine's comment does seem confusing, here's the clarification. The policy gradient derivation itself does not explicitly require the Markov property. Instead, it operates on the probability distribution of trajectories under the parameterized policy $\pi_{\theta}$, which uses the Markov assumption since in most RL with fully observable states any trajectory is generated by a MDP.

cinch
  • 11,000
  • 3
  • 8
  • 17