5

I'm trying to get a better understanding of Multi-Arm Bandits, Contextual Multi-Arm Bandits and Markov Decision Process.

Basically, Multi-Arm Bandits is a special case of Contextual Multi-Arm Bandits where there is no state(features/context). And Contextual Multi-Arm Bandits is a special case of Markov Decision Process, where there is only one state (features, but no transitions).

However, since MDP has Markov property, I wonder if every MDP problem can also be converted into a Contextual Multi-Arm Bandits problem, if we simply treat each state as a different input context (features)?

nbro
  • 42,615
  • 12
  • 119
  • 217
peidaqi
  • 151
  • 2

1 Answers1

4

The main difference between an MDP and contextual bandit setting is time steps and state progression. If those are important to the problem you want to solve, then it is not possible to convert.

Essentially MDPs are a strict generalisation of contextual bandits. You can model a CB as an MDP but not vice-versa.

In some very specific cases you can convert MDP to CB - any one of these situations means that the MDP can be simplified to CB, and then you could use bandit solving algorithms to optimise it:

  • When there is only one time step for an episodic problem.

  • When the discount factor is zero.

  • When state transition rules are completely independent of action choice, but reward is not.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66