5

I have the following code (below), where an agent uses Q-learning (RL) to play a simple game.

What appears to be questionable for me in that code is the fixed learning rate. When it's set low, it's always favouring the old Q-value over the learnt/new Q-value (which is the case in this code example), and, vice-versa, when it's set high.

My thinking was: shouldn't the learning rate be dynamic, i.e. it should start high because at the beginning we don't have any values in the Q-table and the agent is simply choosing the best actions it encounters? So, we should be favouring the new Q-values over the existing ones (in the Q-table, in which there's no values, just zeros at the start). Over time (say every n number of episodes), ideally we decrease the learning rate to reflect that, over time, the values in the Q-table are getting more and more accurate (with the help of the Bellman equation to update the values in the Q-table). So, lowering the learning rate will start to favour the existing value in the Q-table over the new ones. I'm not sure if my logic has gaps and flaws, but I'm putting it out there in the community to get feedback from experienced/experts opinions.

Just to make things easier, the line to refer to, in the code below (for updating the Q-value using the learning rate) is under the comment: # Update Q-table for Q(s,a) with learning rate

import numpy as np
import gym
import random
import time
from IPython.display import clear_output

env = gym.make("FrozenLake-v0")

action_space_size = env.action_space.n state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

num_episodes = 10000 max_steps_per_episode = 100

learning_rate = 0.1 discount_rate = 0.99

exploration_rate = 1 max_exploration_rate = 1 min_exploration_rate = 0.01 exploration_decay_rate = 0.001

rewards_all_episodes = []

for episode in range(num_episodes): # initialize new episode params state = env.reset()

done = False 
rewards_current_episode = 0 

for step in range(max_steps_per_episode):

    # Exploration-exploitation trade-off
    exploration_rate_threshold = random.uniform(0, 1)
    if exploration_rate_threshold > exploration_rate:
        action = np.argmax(q_table[state,:])
    else:
        action = env.action_space.sample()

    new_state, reward, done, info = env.step(action)

    # Update Q-table for Q(s,a) with learning rate
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
        learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

    state = new_state
    rewards_current_episode += reward

    if done == True:
        break


# Exploration rate decay
exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

rewards_all_episodes.append(rewards_current_episode)

Calculate and print the average rewards per thousand episodes

rewards_per_thousands_episodes = np.array_split(np.array(rewards_all_episodes), num_episodes/1000) count = 1000 print("******* Average reward per thousands episodes ************") for r in rewards_per_thousands_episodes: print(count, ": ", str(sum(r/1000))) count += 1000

Print updated Q-table

print("\n\n********* Q-table *************\n") print(q_table)

nbro
  • 42,615
  • 12
  • 119
  • 217
Hazzaldo
  • 309
  • 3
  • 9

1 Answers1

8

Yes you can decay the learning rate in Q-learning, and yes this should result in more accurate Q-values in the long term for many environments.

However, this is something that is harder to manage than in supervised learning, and might not be as useful or necessary. The issues you need to be concerned about are:

Non-stationarity

The target values in value-based optimal control methods are non-stationary. The TD target derived from the Bellman equation is never quite a sample from the optimal action value until the very final stages of learning. This is an issue both due to iterative policy improvements and the bootstrap nature of TD learning.

Reducing learning rate before you reach optimal control could delay finding the optimal policy. In general you want the learning rate to be just low enough that inaccuracies due to over/undershooting the correct value don't prevent or delay differentiating between actions for whatever the interim policy is.

Policy indifference to accurate action values

Q-learning's policy is based on $\pi(s) = \text{argmax}_{a} Q(s,a)$. To obtain an optimal policy, what you care about is identifying the highest-valued action, $a$. As such, it is possible in some environments to achieve optimal control whilst Q values are far from accurate, provided the highest-valued action still has the highest estimate of expected return.

Whether or not you need highly accurate Q values to determine a policy depends on the environment, how similar actions are to each other. For long-term forecasts (i.e. $\gamma \approx 1$) where individual time steps make little difference, this is a more important detail, as there will be only minor differences between action values.

If your goal is to get highly accurate predictions of action values, then the above does not necessarily apply.

Some deterministic environments can use a high learning rate

This only applies to some, simpler environments, but you should definitely bear this in mind when working with tabular solvers.

For instance, it is possible to apply tabular Q-learning to Tic Tac Toe with a learning rate of $1.0$ - essentially replacing each estimate with a new latest estimate - and it works just fine.

In other, more complex environments, this would be a problem and the algorithm would not converge. But clearly, adding a learning rate decay is not a general requirement.

The learning rate decay schedule is a hyper parameter

There is no generic schedule that could apply to all environments and be equally effective in them. For an optimal approach, you would need to run a search over possible decay schedules, and the most efficient learning rate decay would apply only to the environment that you tested. It would likely apply well enough to similar environments that it could be an advantage to know it in future.

In complex environments, there are other working solutions

In practice, for environments where neural networks are used, an approach called Deep Q Networks (DQN) is used. It is common to use a conservative/low learning rate with this due to stability issues of combining neural networks and off-policy reinforcement learning.

In addition, you will quite often see an adaptive gradient optimiser - e.g. Adam or RMSProp - used with DQN. These already adjust gradient step sizes based on recent gradients observed in the neural network during training, so there is less need to add a dynamic learning rate schedule on top.

Supervised learning challenges, such as ImageNet do sometimes add a learning rate schedule over adaptive optimisers, resulting in further improvements. So it might help with DQN, but the other caveats above could still apply.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66