I have seen this happening in implementations of state-of-the-art RL algorithms where the model converges to a single action over time after multiple training iterations. Are there some general loopholes or reasons why this kind of behavior is exhibited?
1 Answers
Why do RL implementations converge on one action?
If the optimal policy shouldn't always select the same action in the same state, i.e., if the optimal policy isn't deterministic (e.g., in the case of the rock paper scissors, the optimal policy cannot be deterministic because any intelligent player would easily memorize your deterministic policy, so, after a while, you would always lose again that player), then there are a few things that you can do to make your policy more stochastic
Change the reward function. If your agent ends up selecting always the same action and you don't want that, it's probably because you're not giving it the right reinforcement signal (given that the agent selects the action that apparently will give it the highest reward in the long run).
Try to explore more during training. So, if you're using a behavior policy like $\epsilon$-greedy, you may want to increase your $\epsilon$ (i.e. probability of selecting a random action).
If you estimated the state-action value function (e.g. with Q-learning), maybe you derived the policy from it by selecting the best action, but, of course, that will make your policy deterministic. You may want to use e.g. softmax to derive the policy from the state-action value function (i.e. the probability of selecting an action is proportional to its value), although Q-learning assumes that your target policy is greedy with respect to the state-action value function.
If the optimal policy is supposed to be deterministic, then, if you find the optimal policy (which isn't probably the case), you will end up with an agent that always selects the same action. In that case, obviously, it's not a problem that the RL agent selects always the same optimal action.
- 42,615
- 12
- 119
- 217