How do I use the decaying $\epsilon$-greedy strategy in Q-learning?

Question

The decaying $\epsilon$-greedy strategy used in the Q-learning algorithm, when should the decay of this $\epsilon$ occur, after each episode (initial state to terminal state), or after each step (one state change) in an episode?

score 2 · Accepted Answer · answered Feb 03 '25 at 03:11

While both methods are theoretically possible, decaying epsilon at each update step until reaching a minimum learning rate is commonly seen in practice. It provides a straightforward global and stable (linear) decay schedule for gradually reducing exploration across all epochs.

In the seminal DQN paper "Human-level Control through Deep Reinforcement Learning" (Mnih et al., 2015), the $\epsilon$ decay occurs globally after each step of state change in a linear fashion (with frame skipping) for the first million frames across all epochs.

The behaviour policy during training was $\epsilon$-greedy with $\epsilon$ annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter.

Also in coding practice, many OpenAI Gym benchmark implementations and related DQN examples typically use similar step-level $\epsilon$ decay as shown in OpenAI Baselines DQN model's $\epsilon$ decay global step-level schedule.

 exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
                                 initial_p=1.0,
                                 final_p=exploration_final_eps)

It allows the decay schedule to be specified relatively, so that if you change the total training timesteps, you can maintain a similar exploration-exploitation ratio balance. After the specified number of timesteps (exploration_fraction × total_timesteps), the epsilon remains constant at the final value. This approach is used in the Baselines code to make the exploration schedule more adaptable to different training regimes.

score 1 · Answer 2 · answered Feb 02 '25 at 19:05

A question to reflect on is: Why must it decay? A deep reflection here may help to understand it's roles and effects.

But to directly answer:

If epsilon starts low enough, it may not need to. But if it does, the choice of when to decay shapes how the agent learns. Decaying per episode encourages long-term exploration, while per-step decay shifts behavior even within a single trajectory. Each approach influences what the agent prioritizes—steady refinement across episodes or rapid adjustment within one. The right choice depends on what you want the learning process to emphasize.

How do I use the decaying $\epsilon$-greedy strategy in Q-learning?

2 Answers2