5

In Reinforcement Learning, it is common to use a discount factor $\gamma$ to give less importance to future rewards when calculating the returns.

I have also seen mention of discounted state distributions. It is mentioned on page 199 of the Sutton and Barto textbook that if there is discounting then (for the state distribution) it should be treated as a form of termination, and it is implied that this can be achieved by adding a factor of $\gamma$ to the state transition dynamics of the MDP, so that now we have

$$\mu(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')}\;;$$ where $\eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s})\sum_a \pi(a|\bar{s}) \gamma p(s|\bar{s}, a)$ and $h(s)$ is the probability of the episode beginning in state $s$.

In my opinion, the book kind of skips over this and it is not immediately clear to me why we need to discount our state distribution if we have discounting in the episode.

My intuition would suggest that it is because we usually take an expectation of the returns over the state distribution (and action/transition dynamics), but, if we are discounting the (future) rewards, then we should also discount the future states to give them less importance. In Sergey Levine's lectures he provides a brief aside that I think agrees with my intuition but in a rather unsatisfactory way -- he introduces the idea of a 'death state' that we transition into at each step with probability $1-\gamma$ but he does not really provide a rigorous enough justification for thinking of it this way (unless it is just a useful mental model and not supposed to be rigorous).

I am wondering whether someone can provide a more detailed explanation as to why we discount the state distribution.

nbro
  • 42,615
  • 12
  • 119
  • 217
David
  • 5,100
  • 1
  • 11
  • 33

3 Answers3

1

Not an exhaustive answer, but perhaps this blog post by Alessio Russo may be helpful. In particular, he states how

There is an equivalence between using a discount factor and reaching a terminal state in a Markov Decision Process. [...] Therefore, we simply need to introduce, artificially, the possibility of terminating the trajectory with a certain probability 1-γ.

The topic is not 100% clear to me yet either, but I feel like it makes much more sense now why this "death state" is used as a model to replace the discounted state distribution.

Gabriele
  • 26
  • 1
1

If you have three possible next states from the current state, by adding the discount factor you are introducing a fourth state. It can be a terminal state or some other state that is hidden. The probability of entering that hidden state 1-γ. The probability of all other three initial possible next state is γ.

Suppose you are entering the termination state by probability 1-γ. Your policy is giving a 1-γ probability of dying, so your policy has to avoid that termination and has to make a better choice now i.e in the current state.

DeepQZero
  • 1,733
  • 1
  • 10
  • 36
1

The R.S. Sutton and A. G. Barto book says "Often $\mu(s)$ is chosen to be the fraction of time spent in $s$? But then why a discount factor is needed in the continual case?

Discounting is practical when working with discounted rewards but not necessary when working with average rewards.

Why do may need to discount our state distribution with discounted rewards?

Let $\gamma \in [0,1[$ be the discount factor.

Let the discounted state measure be defined as $$ \mu(s) = \sum_{t=0}^{\infty} \gamma^t \mathbb{P}^\pi(S_t = s) $$ where $\mathbb{P}^\pi(S_t = s)$ is the probability of being in state $s$ at time $t$.

$$ \int_\mathcal{S} \mu(s) ds = \sum_{s \in \mathcal{S}} \mu(s) = \sum_{s \in \mathcal{S}} \sum_{t=0}^{\infty} \gamma^t \mathbb{P}^\pi(S_t = s) = \sum_{t=0}^{\infty} \gamma^t \sum_{s \in \mathcal{S}} \mathbb{P}^\pi(S_t = s) = \sum_{t=0}^{\infty} \gamma^t = \frac{1}{1-\gamma} $$

where $\mathcal{S}$ is the state space. Thus, to be a distribution $\mu$ has to be multiplied by $1 -\gamma$ as in [1].

Note that if $\gamma = 1$, then the total mass of $\mu$ goes to $+\infty$. The distribution $\mu$ is said improper in this case.

Also remark that, $$ \mathbb{P}^\pi(S_t = s, A_t = a) = \mathbb{P}^\pi(S_t = s) \mathbb{P}^\pi(A_t = a | S_t = s) = \mathbb{P}^\pi(S_t = s) \pi(a|s) $$ where $\pi(a|s)$ is the policy.

Now, remark with $r : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ the reward function.

$$ \mathbb{E}^\pi[\sum_{t=0}^{\infty} \gamma^t r(S_t, A_t)] = \sum_{t=0}^{\infty} \gamma^t \mathbb{E}^\pi[r(S_t, A_t)] = \sum_{t=0}^{\infty} \gamma^t \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} r(s,a) \mathbb{P}^\pi(S_t = s, A_t = a) = \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} \sum_{t=0}^{\infty} \gamma^t \mathbb{P}^\pi(S_t = s) \pi(a|s) r(s,a) = \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} \mu(s) \pi(a|s) r(s,a) = \mathbb{E}_\mu\mathbb{E}_\pi[r(S,A)] $$

Let's break down what happened here. We started by averaging the reward of the trajectory $(S_0, A_0, S_1, A_1, \ldots)$ that is characterised by the distribution $\mathbb{P}^\pi$. This distribution characterised a trajectory (that's why I used the subscript $t$ inside the expectation).

On the other hand, the distribution $\mu$ is defined on the state-space $\mathcal{S}$ and characterises the probability a state is visited (that's why I do not used the subscript $t$ inside the expectation, to mark the time-independence of the random variable).

Consequently $\mathbb{P}^\pi$ and $\mu$ are two different distributions that are defined on different spaces! The distribution $\mu$ averages over the time inherently. This kind of change of measure often appears in RL literature and are not always obvious to understand.

Introducing the discounted measure allows in practice to sample state-action pairs $(X, A) \in \mathcal{X} \times \mathcal{A}$ to evaluate the objective function, instead of doing trajectory-based Monte-Carlo methods. In practice, it is very convenient and allows to employ "replay-buffer" for instance.

Going back to your question we have observed the importance of discounting in the "continual case" (infinite trajectories).

Why multiplying by $\gamma$ introduces a termination condition?

Take a probability distribution $\mu(\cdot)$ or $\mathcal{P}(\cdot, x, a)$ for some $x \in \mathcal{S}$ and $a \in \mathcal{A}$. Then $$ \int_\mathcal{S} \gamma \mu(s) ds = \gamma \int_\mathcal{S} \mu(s) ds = \gamma = \gamma \int_\mathcal{S} \mathcal{P}(s, x, a) ds $$

So both distribution $\gamma \mu$ and $\gamma \mathcal{P}$ are not probability distributions anymore! They are not normalised to 1. One possible way to get a probability distribution is to normalise it OR to introduce a state $s_{\text{end}}$ that is absorbing and has a reward of 0 with a probability of $1-\gamma$. This new MDP constructed by augmenting the state space with $s_{\text{end}}$ is then considered.


For mathematicians, to work rigorously with $\mu$ it is always better to use its truncated version and pass to the limit since limits of measures are still measures when the limit exists.

[1]: A. Agarwal et al. - Reinforcement Learning: Theory and Algorithms