7

I have studied in the past different algorithms, i.e. DQN, DDQN, REINFORCE, A3C, PPO, TRPO, so on. I am doing an internship this summer where I have to use a multi-armed bandit (MAB). I am a bit confused between MAB and the other above algorithms.

What are the major differences between MAB and REINFORCE, for instance? What are the major differences between MAB and the other well-known algorithms (DQN, A3C, PPO, etc)?

EDIT

The @Kostya's answer is fine, but it would be interested for me to have a deeper answer for that question. I am still a bit confused.

Question: Do we use the Markov Reward formula $$G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$ the same way in a Multi-Armed bandit problem versus a DQN problem?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

12

You should start with the general definition of Reinforcement Learning problem. And what Markov Decision Process is.

DQN, A3C, PPO and REINFORCE are algorithms for solving reinforcement learning problems. These algorithms have their strengths and weaknesses depending on the details of the underlying problem.

Multi-Armed Bandit is not even an algorithm - it is a subclass of reinforcement learning problems, where your environment (usually) doesn't have any state transitions and your actions are just a single choice from (usually) fixed and finite set of choices.

Multi-Armed Bandit is used as an introductory problem to reinforcement learning, because it illustrates some basic concepts in the field: exploration-exploitation tradeoff, policy, target an estimate, learning rate and gradient optimization. All these concepts are basic vocabulary in RL. I recommend reading (and, very importantly, doing all the exercises) the Sutton and Barto book chapter two to get familiarized with it.

Edit: since the answer got popular, I'll address the comments and the question edit.

Being a special simplified subset of Markov Decision Processes, Multi Armed Bandit problems allow deeper theoretical understanding. For example, (as per @NeilSlater comment) the optimal policy would be to always go for the best arm. So it makes sense to introduce "regret" $\rho$ - the difference between a potential optimal reward and the actual collected reward by agent following your strategy:

$$\rho(T) = \mathbb{E}\left[T\mu^* -\sum_{t=1}^T\mu(a_t)\right]$$

One can then study asymptotic behavior of this regret as a function of $T$ and devise strategies with different asymptotic properties. As you can see, the reward here is not discounted ($\gamma=1$) - we usually can study the behavior of it as a function of $T$ without this regularization.

Although, there is one famous result that uses discounted rewards - the Gittins index policy (note, though, that they use $\beta$ instead of $\gamma$ to denote the factor).

Kostya
  • 2,667
  • 12
  • 24
-1

MAB is fundamentally different from MDP problem, which so-called well-known RL algorithms trying to solve. The point is that MAB's policy is exactly the action. The Q function takes only action(policy) $a$ as its input. Whereas in MDP, the policy is a function, The Q function takes not only state-action $s,a$ but also policy parameter $\theta$, which makes evaluation of policy and improvement of policy tangled together.

In other words, in MAB, if one has good evaluation of a policy(action), then optimal policy is directly given. But in MDP, if one has good evaluation of a policy, the optimal policy is still unclear. That is why one needs DQN to do policy Evaluation and policy Improvement together.

As for policy gradient policy like PPO, it takes another road to directly optimize the polciy to a more-promising gradient direction based on the feedback from reward.

zhixin
  • 53
  • 6