Converging to a wrong optimal policy if the agent is given more choices

Question

I am a bit new to Reinforcement learning. So, I am extremely sorry if I am asking something obvious. I have written a small piece of code to find the optimal policy for a 5x5 grid problem.

Scenario 1. The agent is only given two choices (Up, Right). I believe, I am getting an optimal policy.
Scenario 2. The agent is given four choices (Up, Right, Down, Left). I am getting the wrong answer.

I have represented actions with numbers:

0 - Right
1 - Up
2 - Down
3 - Left

When the action Up is chosen, with 0.9 probability it will move up or 0.1 probability move right and vice-versa. When the action Down is chosen, with 0.9 probability it will move down or 0.1 probability move left and vice-versa.

I did not use any convergence criteria. Instead let it run for sufficient iterations. I have indeed confirmed that my optimal state values and policy is converging but to a wrong number. I am attaching the code below:

def take_right(state):
    if (state/n < n-1): state = state + n
    return state
def take_up(state):
    if (state%n!=n-1): state = state + 1
    return state
def take_left(state):
    if (state/n > 0): state = state - n
    return state
def take_down(state):
    if (state%n > 0): state = state - 1
    return state

Scenario 1 result:

Scenario 2 result:

Green has a reward of 100 and Blue has a penalty of 100. Rest of the states have a penalty of 1. Discount factor is chosen as 0.5

Edit:

This was really silly question. The problem with my code was more pythonic than RL. Check the comments to get the clue.

score 2 · Accepted Answer · answered May 21 '21 at 13:34

Reinforcement Learning is really fun because the agent will find any bug in your implementation and will exploit it.

>>> take_left(0)
0
>>> take_left(1)
-4

The agent figured out your bug with negative values and exploits negative indexing to get to the target faster.

Converging to a wrong optimal policy if the agent is given more choices

1 Answers1