1

I am confused about the theoretical framework of reinforcement learning. For supervised learning, there seems to be a clear theoretical framework, e.g. as described by Wikipedia here. I am unclear about a similar framework for RL.

It seems that MDPs are baked into typical introduction to RL courses and theoretical descriptions of RL in the literature. However, certain problems that are also considered under RL don't seem to fit this: the multi-arm bandit problem, for example (or more practically, maybe RLHF), is often considered as a RL problem but doesn't fit the MDP model well (although I guess they can be modeled as a MDP with just 1 timestep); some RL problems, like 2-player games, don't seem to be able to be modeled under the typical RL MDP setup at all.

Is there a theoretical RL framework that encapsulates everything?

nbro
  • 42,615
  • 12
  • 119
  • 217
Joe C.
  • 111
  • 3

3 Answers3

2

RL is not limited to MDPs, but standard textbooks will probably start with that because the most common RL algorithms (like Q-learning) assume an MDP. So, an MDP is just a way to model the environment. RL is the way you learn in that environment and RL is just a trial-and-error approach to learning, where you get feedback (reinforcement) for your actions, but you can also solve MDPs with dynamic programming (see policy iteration).

You can model multi-agent systems with Markov Games (aka Stochastic Games), which are a simple generalisation of MDPs. See my blog post.

There are also other models, like POMDPs, which are a generalisation of MDPs for other scenarios, in this case, partially observable environments, i.e. you don't know exactly the state you're in.

So, you can model your problem in different ways. Some problems may be modelled with MDPs, but the states may not be really fully observable, it's just an approximation. Sometimes it works. Take a look at the original DQN paper, where they tried to approximate states with a sequence of images.

Anyway, you may be interested in multi-agent RL. This book looks good, but I only skimmed through it.

nbro
  • 42,615
  • 12
  • 119
  • 217
1

Multi-armed bandits are MDPs composed of a single timestep and a single state, so they indeed fit the RL formulation, though it allows for more general settings.

RL does not deal (naively) with multi-agent settings, but it can be used to approximate solutions. However, in such settings, you usually back the approach using Markov Games, Decentralized POMDPs or Mean Field Games, instead of MDPs

Alberto
  • 2,863
  • 5
  • 12
1

MDPs do an excellent job of modeling most problems.

One-arm bandits are MDPs with a single state, and every action returns to that state. The natural extension contextual bandits have multiple states chosen stochastically by the bandit and look even more like MDPs.

Similar to a contextual bandit, multi-player games can often be modeled as an MDP in which the player has limited control of the next state. For a player, the other players exist in the transitions. The player chooses an action, and the other players determine the resulting state. In an MDP, this looks like a probability of transitioning to each resulting state.

Our hope in RL is that the input can somehow help an agent maximize its reward in an environment. MDPs can be a convenient way to consider how this might happen, but they are not required.

foreverska
  • 2,347
  • 4
  • 21