If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

Question

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being

\begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S} \eta(k)}, \end{equation}

where

\begin{equation} \eta(s) = h(s) + \sum_s \eta(\bar{s}) \sum_a \pi(a|\bar{s})p(s|\bar{s},a), \text{ (9.2)} \end{equation}

is the number of time steps spent, on average, in state $s$ in a single episode.

Another way to represent this is setting $\eta(s) = \sum_{t=0}^{\infty} d_{j,s}^{(t)}$, the average number of visits to $s$ starting from $j$, and $d_{j,s}^{(t)}$ being the probability of going from $j$ to $s$ in $t$ steps under policy $\pi_{\theta}$. In particular, $d_{j,s}^{(1)} = d_{j,s} = \sum_{a \in A}[\pi_{\theta}(a|j)P(s|j,a)]$. This formulation is obtained through pag 325 proof of the Policy Gradient Theorem (PGT) and some basic stochastic processes theory.

If instead of defining $\gamma = 1$, we prove PGT using any $\gamma \in (0,1)$, we would get

\begin{equation*} \eta_{\gamma}(s) = \sum_{t=0}^{\infty} \gamma^t d_{j,s}^{(t)} \end{equation*}

This is not anymore the average number of visits to $s$. Here comes my first question. Would we still get a $\mu_{\gamma}$ on-policy distribution through the same trick done before? That is \begin{equation} \mu_{\gamma}(s) = \frac{\eta_{\gamma}(s)}{\sum_{k \in S} \eta_{\gamma}(k)}, \end{equation} would be the on-policy distribution?

My second question is related and has to do with the phrase on page 199, that says that

If there is discounting ($\gamma <1$) it should be treated as a form of termination, which can be done simply by including a factor of $\gamma$ in the second term of (9.2).

What the authors mean by "as a form of termination"?

As inferred by my previous question, my conclusion is that having $\gamma < 1$ does not alter $\mu_{\gamma}$ being the on-policy distribution, so I don't get this last comment on page 199.

score 5 · Accepted Answer · edited May 14 '21 at 18:23

This question is really getting at the meaning of the discount factor in Markov decision processes. There are actually two, equivalent ways of interpreting the discount factor.

The first is probably what you're familiar with: the $i^{th}$ step receives a reward of $\gamma^{i}r_i$ instead of just $r_i$.

The second, equivalent interpretation, is that at every step, we have a $1-\gamma$ chance of immediately ending the episode. Assuming we don't end the episode, we receive the full $r_i$ reward. This is what Sutton and Barto mean by termination. Note that under the second interpretation, we have a $\gamma^i$ probability of reaching the $i^{th}$ step, hence the connection to the first interpretation.

Your definition of $\eta_{\gamma}(s)$ looks correct to me. Alternatively, we can write it in Sutton's, Barto's notation like $$\eta_{\gamma}(s) = h(s) + \sum_{\bar s}\eta(\bar s)\sum_{a}\gamma \pi(a | \bar s) p(s | \bar s, a)$$

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

1 Answers1