Disparity between training and evaluation episodes

Question

I am new to reinforcement learning topic. I tried to re-train DDPG control this DDPG control example using exact same configuration (except the max training episode set to 500) and this is the result I got:

It is quite different compared to the provided screen capture (below), despite the random seed has been fixed reproducibility purpose.

Even though my training process was successful (evaluation statistic hit 2000), the last training episodes still couldn't match evaluation episodes. How to properly interpret this result? Is this normal?

score 1 · Answer 1 · answered May 01 '25 at 15:41

Training involves some amount of randomness. So no two training cycles are likely the same.

During evaluation the policy is followed deterministically (without this exploration noise) so evaluation can lead training in total rewards. Exploitation noise often contains suboptimal actions. But the policy may already be close enough to optimal to score perfectly in evaluation.

Imagine playing a game where sometimes the input is changed to another random one. One may know the best action but the score is low due to ill timed random actions. But on the flip side, if one doesn’t know the best action, this randomness may help the player try something they haven’t before.

Disparity between training and evaluation episodes

1 Answers1