What is ergodicity in a Markov Decision Process (MDP)?

Question

I have read about the concept of ergodicity on the safe RL paper by Moldovan (section 3.2) and the RL book by Sutton (chapter 10.3, 2nd paragraph).

The first one says that "a belief over MDPs is ergodic if and only if any state is reachable from any other state via some policy or, equivalently, if and only if":

$$\forall s, s', \exists \pi_r \text{ such that } E_\beta E_{s, \pi_r}^P [B_{s'}] = 1$$

where:

$B_{s'}$ is an indicator random variable of the event that the system reaches state $s'$ at least once, i.e., $B_{s'} = 1 \{ \exists t < \infty \text{ such that } s_t = s'\}$
$E_\beta E_{s, \pi_r}^P[B_{s'}]$ is the expected value for $B_{s'}$, under the belief over the MDP dynamics $\beta$, policy $\pi$ and transition measure $P$.

The second one says "$\mu_\pi$ is the steady-state distribution, which is assumed to exist for any $\pi$ and to be independent of $s_0$. This assumption about the MDP is known as ergodicity.". They define $\mu_\pi$ as:

$$\mu_\pi(s) \doteq \lim_{t \to \infty} \Pr\{s_t=s \vert a_{0:t-1} \sim \pi\}$$

i.e., there is a chance of landing on state $s$ by executing actions according to policy $\pi$.

I noticed that the first definition requires that at least one policy should exist for each $(s, s')$ pair for the MDP to be ergodic The second definition, however, requires that all policies eventually visit all the states in an MDP, which seems to be a more strict definition.

Then, I came accross the ergodicity definition for Markov chains:

A state $i$ is said to be ergodic if it is aperiodic and positive recurrent. In other words, a state $i$ is ergodic if it is recurrent, has a period of $1$, and has finite mean recurrence time. If all states in an irreducible Markov chain are ergodic, then the chain is said to be ergodic.

This leads me to believe that the second definition (the stricter one) is the most appropriate one, considering the ergodicity definition in an MDP derives from the definition in a Markov chain. As an MDP is basically a Markov chain with choice (actions), ergodicity should mean that independently of the action taken, all states are visited, i.e., all policies ensure ergodicity.

Am I correct in assuming these are different definitions? Can both still be called "ergodicity"? If not, which one is the most correct?

mikkola · Accepted Answer · 2021-04-10T11:55:10.507

In short, the relevant class of a MDPs that guarantees the existence of a unique stationary state distribution for every deterministic stationary policy are unichain MDPs (Puterman 1994, Sect. 8.3). However, the unichain assumption does not mean that every policy will eventually visit every state. I believe your confusion arises from the difference between unichain and more constrained ergodic MDPs.

Puterman defines that a MDP is (emphasis in the following mine):

recurrent or ergodic if the state transition matrix corresponding to every deterministic stationary policy consists of a single recurrent class, and

unichain, if the state transition matrix corresponding to every deterministic stationary policy is unichain, that is, it consists of a single recurrent class and a possibly empty set of transient states.

Before unpacking this further, let's first recap what recurrent and transient states in a Markov chain are. The following definitions can be found in Puterman, Appendix A.2. For a state $s$, associate two random variables $\nu_s$ and $\tau_s$ that represent the number of visits and the time of the first visit (or first return if the chain starts in $s$) to state $s$. A state $s$ is

recurrent (sometimes also called ergodic), if $P_s(\tau_s < \infty) = 1$, that is, the state is eventually visited (or returned to), and
transient, if $\mathbb{E}[\tau_s] = \infty$ and $P_s(\tau_s < \infty) < 1$.

It is also true that $s$ is recurrent if and only if $\mathbb{E}[\nu_s] = \infty$, i.e., it is visited infinitely often, and $s$ is transient if and only if $\mathbb{E}[\nu_s] < \infty$, i.e., it is visited only finitely often (and thereby never again after some finite time).

So let's now return to the two types of MDPs above. Consider an arbitrary deterministic stationary policy $\pi$ which maps any state $s$ to an action $a = \pi(s)$. If the MDP is ergodic, then the stationary distribution $\mu_\pi(s)$ exists and is unique, because the Markov chain over states induced by any policy has a single recurrent class (it does not matter in which state $s_0$ the chain starts, the same stationary distribution is reached). There is a single class of recurrent states, i.e., all states are recurrent, therefore any $s$ is visited infinitely often and $\mu_\pi(s) > 0$.

Now, if the MDP is unichain, then once again the stationary distribution $\mu_\pi(s)$ exists and is unique, because the Markov chain over states induced by any policy has a single recurrent class. But, importantly, there may exist a policy $\pi$ for which the set of transient states in the induced Markov chain is not empty. Because any transient state $s$ will only be visited a finite number of times, in the (infinite horizon) stationary distribution $\mu_\pi(s)=0$!

So indeed, if the MDP is of the stricter ergodic type, every policy will eventually visit every state. This is not true for unichain MDPs however.

A final remark: some authors define a policy as ergodic (e.g., Kearns & Singh, 2002), if the resulting Markov chain over states is ergodic (i.e., has a unique stationary distribution). The unichain MDP is a type of MDP where every policy is ergodic.

References:

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.

Kearns & Singh. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning, 49, 209–232, 2002

What is ergodicity in a Markov Decision Process (MDP)?

1 Answers1

Linked