8

So this is my current result (loss and score per episode) of my RL model in a simple two players game:

enter image description here

I use DQN with CNN as a policy and target networks. I train my model using Adam optimizer and calculate the loss using Smooth L1 Loss.

In a normal "Supervised Learning" situation, I can deduce that my model is overfitting. And I can imagine some methods to tackle this problem (e.g. Dropout layer, Regularization, Smaller Learning Rate, Early Stopping).

  • But would that solution will also work in RL problem?
  • Or are there any better solutions to handle overfitting in RL?
Archana David
  • 287
  • 2
  • 9
malioboro
  • 2,859
  • 3
  • 23
  • 47

2 Answers2

10

The accepted answer does not provide a good definition of over-fitting, which actually exists and is a defined concept in reinforcement learning too. For example, the paper Quantifying Generalization in Reinforcement Learning completely focuses on this issue. Let me give you more details.

Over-fitting in supervised learning

In supervised learning (SL), over-fitting is defined as the difference (or gap) in the performance of the ML model (such as a neural network) on the training and test datasets. If the model performs significantly better on the training dataset than on the test dataset, then the ML model has over-fitted the training data. Consequently, it has not generalized (well enough) to other data other than the training data (i.e. the test data). The relationship between over-fitting and generalization should now be clearer.

Over-fitting in reinforcement learning

In reinforcement learning (RL) (you can find a brief recap of what RL is here), you want to find an optimal policy or value function (from which the policy can be derived), which can be represented by a neural network (or another model). A policy $\pi$ is optimal in environment $E$ if it leads to the highest cumulative reward in the long run in that environment $E$, which is often mathematically modelled as a (partially or fully observable) Markov decision process.

In some cases, you are also interested in knowing whether your policy $\pi$ can also be used in a different environment than the environment it has been trained in, i.e. you're interested in knowing if the knowledge acquired in that training environment $E$ can be transferred to a different (but typically related) environment (or task) $E'$. For example, you may only be able to train your policy in a simulated environment (because of resource/safety constraints), then you want to transfer this learned policy to the real world. In those cases, you can define the concept of over-fitting in a similar way to the way we define over-fitting in SL. The only difference may be that you may say that the learned policy has over-fitted the training environment (rather than saying that the ML model has over-fitted the training dataset), but, given that the environment provides the data, then you could even say in RL that your policy has over-fitted the training data.

Catastrophic forgetting

There is also the issue of catastrophic forgetting (CF) in RL, i.e., while learning, your RL agent may forget what it's previously learned, and this can even happen in the same environment. Why am I talking about CF? Because what it's happening to you is probably CF, i.e., while learning, the agent performs well for a while, then its performance drops (although I have read a paper that strangely defines CF differently in RL). You could also say that over-fitting is happening in your case, but, if you are continuously training and the performance changes, then CF is probably what you need to investigate. So, you should reserve the word over-fitting in RL when you're interested in transfer learning (i.e. the training and test environments do not coincide).

nbro
  • 42,615
  • 12
  • 119
  • 217
4

Overfitting refers to a model being stuck in a local minimum while trying to minimise a loss function. In Reinforcement Learning the aim is to learn an optimal policy by maximising or minimising a non-stationary objective-function which depends on the action policy, so overfitting is not exactly like in the supervised scenario, but you can definitely talk about sub-optimal policies.

If we think of a specific task like avoiding stationary objects, a simple sub-optimal policy would be to just stay still without moving at all, or moving in circles if the reward function was designed to penalise lack of movements.

The way to avoid an agent to learn sub-optimal policies is to find a good compromise between exploitation, i.e. the constant selection of the next action to take based on the maximum expected reward possible, and exploration, i.e. a random selection of the next action to take regardless of the rewards. Here's a link to an introduction to the topic: Exploration and Exploitation in Reinforcement Learning

It is worth mentioning that sometimes an agent can actually outsmart humans though, some examples are reported in this paper The Surprising Creativity of Digital Evolution. I particularly like the story of the insect agent trained to learn to walk while minimising the contact with the floor surface. The agent surprisingly managed to learn to walk without touching the ground at all. When the authors checked what was going on they discovered that the insect leaned to flip itself and then walk using its fake 'elbows' (fig7 in the linked paper). I add this story just to point out that most of the time the design of the reward function is itself even more important than exploration and exploitation tuning.

Edoardo Guerriero
  • 5,506
  • 1
  • 15
  • 25