In the frozen lake environment of Gymnasium, why aren't the holes negatively rewarded?

Question

In this given map, for example, the agent needs to perform the downward action twice to reach the reward. Considering that initially all actions are equally likely, the probability of reaching the reward is really low. If the agent never encounters the reward during the training period, the policy will never improve.

In such scenarios, doesn't it make sense to provide a negative reward everytime an undesirable action takes place to deter the agent from taking that action again?

score 6 · Accepted Answer · answered Jan 13 '25 at 23:45

In your specific grid world there's only one short non-repetitive path to the goal requiring two consecutive downward moves, and all other paths lead to holes or much longer paths. Therefore to balance exploration and exploitation you'd better shape the reward by intermediate reward near the goal and different penalties for holes and other steps, assuming you're using model-free Q-learning or some other on-policy methods.

Provide a significant positive reward (e.g., +10) for reaching the goal state, and some slightly smaller reward (e.g., +5) for the two closer valid states around the goal. Assign a substantial negative reward (e.g., -10) for falling to any hole state and a small negative reward (e.g., -1) for other remaining states to encourage the agent to find the shortest path avoiding aimless wandering.

score 3 · Answer 2 · answered Jan 13 '25 at 23:08

Whether or not something should be negatively rewarded gets into reward shaping. It might speed up convergence but it being negative is not necessary for the agent to find a policy. The only requirement is that undesirable behavior is less rewarded than desirable behavior.

Whether or not this environment is tough comes down to algorithm details. For instance, if the Q table initializes to zeros and exploration is totally random, it might be pretty tough. But if the Q table is optimistically initialized (10 instead of 0) the learning is likely to be a lot quicker as actions leading to holes shrink to their true value under paths leading to the goal.

In the frozen lake environment of Gymnasium, why aren't the holes negatively rewarded?

2 Answers2