2

I am thinking about episodic MDPs. Usually, in episodic MDPs, it seems that we have a finite fixed horizon per episode and no discount factor. Then, a very intuitive notion of regret after $T$ episodes is to sum over the difference of optimal expected return and achieved expected return.

I was wondering about notions of regret for infinite horizon discounted MDPs. It is not clear to me what a reasonable notion of regret for this setting would be, and I am also not aware of any standard definition of regret in this setting.

Maybe, as a justification for infinite horizon episodic MDPs, this quote by Littman in his paper: Markov games as a framework for multi-agent reinforcement learning

As in MDP's, the discount factor can be thought of as the probability that the game will be allowed to continue after the current move. It is possible to define a notion of undiscounted rewards [Schwartz, 1993], but not all Markov games have optimal strategies in the undiscounted case [Owen, 1982]. This is because, in many games, it is best to postpone risky actions indefinitely. For current purposes, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later.

nbro
  • 42,615
  • 12
  • 119
  • 217
Felix P.
  • 295
  • 1
  • 7

0 Answers0