Why is old/off-policy data harmful to on-policy/online RL?

Question

I ask because if RL is indeed an MDP, then there should be absolutely no problem with training an agent on any available episode roll-out data, right? Because an MDP implies for any state S, the optimal action to take is entirely dependent on the state.

So while I can understand the argument from prioritized experience replay that some data is more valuable for learning than others. I cannot understand how old data would be harmful.

Further Clarification:

The emphasis on an MDP is perhaps misleading. I don't mean to imply that optimal control should be dependent on previous states necessarily... Similarly, the emphasis on off-policy & on-policy is perhaps better placed on online vs offline RL.

Performance Gap:

This [offline RL] setting is harder as it removes the ability for agents to explore the environment and collect additional feedback.

Consider a simple example: with Q-table 'learning'. It seems to me that Online-RL is analogous to deleting perfectly valid entries from the Q-table in order to reap better performance. This seems to me like a paradox and that's what confuses me.

Neil Slater · Answer 1 · 2022-01-06T16:12:43.270

there should be absolutely no problem with training an agent on any available episode roll-out data. That is because a MDP implies for an any state S, the optimal action to take is entirely dependent on the state.

The desired end-state of the trained model is that it can identify the optimal action. When comparing reinforcement learning (RL) methods, you are looking at how they use data from experience. This data does not directly identify optimal actions, and all RL control approaches perform significant processing of the raw data in order to learn about the environment and make this identification.

Any rollout may contain useful data that can be learned from for control, but it requires processing in order to drive a meaningful policy. When comparing on-policy and off-policy, you are looking at two different approaches to process this data.

The differences between on-policy and off-policy methods are to do with the effectiveness of the learning algorithm in particular scenarios, and not related to using MDPs as a framework. Instead the existence of the two approaches (and other ways to solve reinforcement learning problems) are a logical consequence of the MDP framework.

I cannot understand how old data would be harmful.

When learning for optimal control, then the policy must change in order to improve. Data collected under old policies has a different distribution and results from data expected under the current policy. You can often adjust for this and still learn something (which may or may not be useful to the rest of the task) - that is exactly what off-policy learning does.

Any data can be harmful in machine learning if your learning model makes incorrect assumptions about it. That is the case when an on-policy learning algorithm attempts to use data collected under a different policy. This also applies to many other types of machine learning - if something is different in an old data set to your target distribution, the trained model will be biased towards old data and make less accurate predictions on new examples.

if optimal control were indeed a MDP, then there should be absolutely no problem with training an agent on any available episode roll-out data

This is not true in general. Even off-policy algorithms need there to be some useful relationship to the current target policy. Data from completely random rollouts in complex problems is rarely useful for learning, except as an initial bootstrap when nothing is known about the environment.

Assuming there is some useful data that could be extracted, then there is no problem training an agent, provided that agent adjusts correctly for the difference in policy. That requires off-policy agents, because on-policy agents do not have the ability to make the adjustment.

Why is old/off-policy data harmful to on-policy/online RL?

1 Answers1