Mathematically, what is happening differently in the neural net during exploration vs. exploitation?

Question

I want to understand roughly what is happening in the neural network of an RL agent when it is exploring vs. exploiting. For example, are the network weights not being updated when the agent is exploiting? Or somehow being updated to a lesser degree?

If this question is algorithm-dependent, I'm mostly interested in PPO, but any insight is appreciated.

Lee Reeves · Accepted Answer · 2022-06-07T22:46:40.467

Typically, the NN is trained the same way whether an action is chosen for exploration or exploitation. Look at the objective (AKA loss) function for any algorithm you're interested in and you'll probably find that it doesn't contain a term for exploration vs exploitation.

Instead, the choice of exploration vs exploitation, and the tradeoff, is about how the data used to train the neural network are generated.

So in a rough sense, what's happening to the NN is that during exploration, the network is being trained on novel data (not always, but it's more likely), while during exploitation, the network is being trained on well worn paths.

Then even though the gradient formula is the same, the gradient may be smaller and the weights may change less during exploitation, because the loss is typically lower when choosing the optimal action, and with some loss functions (like MSE, but not L1 loss) a bigger loss results in a bigger gradient.

Mathematically, what is happening differently in the neural net during exploration vs. exploitation?

1 Answers1