2

Puterman defines an ergodic MDP as

if the transition matrix corresponding to every deterministic stationary policy consists of a single recurrent class.

If the transition matrix is recurrent, it means that there is one aperiodic communicating class of states (the whole states space).

However, this definition seems extremely restricting because of "aperiodic".

Consider a simple chainworld where every state is connected to every state. The whole state space is communicate. I could make a policy that always goes to one state (say X) and stays there forever. For example, the policy would induce the following trajectory: A, B, C, X, X, X, ... Clearly, I am visiting A, B, C once, but only X is recurrent.

Am I interpreting the definition wrong? Should I consider only states visited in the limit? I.e., according to the stationary state distribution?

Am I being correct assuming that the transition matrix is $P = \sum_a \mathcal{P}(s' | s, a) \pi(a|s)$ where $\mathcal{P}$ is the MDP dynamics matrix and $\pi$ is the policy?

I found this related question but I didn't understand the answer.

This question is also related but does not seem to address my concern about "aperiodic".

Simon
  • 263
  • 1
  • 8

1 Answers1

2

The apparent issue here is that you are interpreting every visited state in a trajectory as part of the recurrent class which is incorrect. Only the limit behavior of the policy matters, which is described by the stationary distribution induced by the policy. Your MDP transition matrix is fine but the states that form the recurrent class are those that can be visited infinitely often under the stationary policy and satisfy the stationary state distribution.

Finally you are correct that aperiodicity is restrictive, and aperiodicity is defined per state within the recurrent class of the transition matrix and ergodicity of the whole chain requires all states are aperiodic with finite mean recurrence time.

cinch
  • 11,000
  • 3
  • 8
  • 17