In this given map, for example, the agent needs to perform the downward action twice to reach the reward. Considering that initially all actions are equally likely, the probability of reaching the reward is really low. If the agent never encounters the reward during the training period, the policy will never improve.
In such scenarios, doesn't it make sense to provide a negative reward everytime an undesirable action takes place to deter the agent from taking that action again?
