4

I've read Sutton and Barto's introductory RL book. They define a policy as a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ as the probability of taking action $A_t = a$ when the current state is $S_t = s$. This definition is the context of the markov assumption, which is why the policy is only dependent on the current state.

When discussing the standard k-armed bandits problem, they write $\pi(a)$ to denote the probability of taking action $a$, since there are no states. However, when designing the agent, clearly, the agent needs to keep track of what the past rewards are for each lever, so either there is a summary statistic of each lever, or the entire history of actions and rewards must be kept.

Is the k-armed bandit problem then a MDP? Why isn't the notation $\pi(a|A_0, R_1, A_1, \ldots, R_T)$ for some sequence $A_0, R_1, A_1, \ldots, R_T$?

nbro
  • 42,615
  • 12
  • 119
  • 217
Snowball
  • 225
  • 1
  • 7

1 Answers1

2

The bandit problem is an MDP. You can make the same argument about needing data to learn in the stateful MDP setting. The thing is, the data you need (the past rewards in this case) was drawn iid (conditioned on the arm) and is not actually a trajectory. For instance, once you learn an optimal policy, you no longer need to gather data and the sequence of past results doesn't influence your policy.

harwiltz
  • 1,166
  • 1
  • 8
  • 6