0

I am a bit new to Reinforcement learning. So, I am extremely sorry if I am asking something obvious. I have written a small piece of code to find the optimal policy for a 5x5 grid problem.

  • Scenario 1. The agent is only given two choices (Up, Right). I believe, I am getting an optimal policy.
  • Scenario 2. The agent is given four choices (Up, Right, Down, Left). I am getting the wrong answer.

I have represented actions with numbers:

0 - Right
1 - Up
2 - Down
3 - Left

When the action Up is chosen, with 0.9 probability it will move up or 0.1 probability move right and vice-versa. When the action Down is chosen, with 0.9 probability it will move down or 0.1 probability move left and vice-versa.

I did not use any convergence criteria. Instead let it run for sufficient iterations. I have indeed confirmed that my optimal state values and policy is converging but to a wrong number. I am attaching the code below:

def take_right(state):
    if (state/n < n-1): state = state + n
    return state

def take_up(state): if (state%n!=n-1): state = state + 1 return state

def take_left(state): if (state/n > 0): state = state - n return state

def take_down(state): if (state%n > 0): state = state - 1 return state

Scenario 1 result:

Given only two choices

Scenario 2 result:

Given four choices

Green has a reward of 100 and Blue has a penalty of 100. Rest of the states have a penalty of 1. Discount factor is chosen as 0.5

Edit:

This was really silly question. The problem with my code was more pythonic than RL. Check the comments to get the clue.

Kostya
  • 2,667
  • 12
  • 24
Tyrion
  • 3
  • 2

1 Answers1

2

Reinforcement Learning is really fun because the agent will find any bug in your implementation and will exploit it.

>>> take_left(0)
0
>>> take_left(1)
-4

The agent figured out your bug with negative values and exploits negative indexing to get to the target faster.

Kostya
  • 2,667
  • 12
  • 24