What if actions from an RL agent are overriden?

Question

I’m developing an RL agent. This isn’t exactly my case, but I’ll use a metaphor to explain my situation (gotta protect my research’s originality, haha).

Let’s say I’m working on a self-driving car algorithm using RL (not a vehicle engineer, though). In the real world, drivers might override the self-driving mode. For example, if my agent is trained to minimize travel time, it might end up compromising the driver’s comfort with sudden acceleration or braking. As a result, drivers may intervene and override the system.

If my environment consists of (i) a car and (ii) a driver who can intervene, my RL agent might struggle to explore the full action space during training due to these overrides. I expect that, eventually, the agent will learn to interact with the driver and optimize its policy to maximize rewards, but… that could take a really long time.

I was wondering—are there any established approaches to deal with situations like this? Have there been discussions on handling cases where external interventions limit an RL agent’s exploration?

score 3 · Answer 1 · answered Mar 04 '25 at 07:54

If your trained agent needs constant manual interventions, it defeats RL's goal. One strategy can be simply to treat external interventions such as driver overrides as safety constraints for the agent. Algorithms like Constrained Policy Optimization (CPO) or other safe RL methods explicitly incorporate constraints into the optimization process. By penalizing behaviors that lead to interventions or even preventing them during training, the agent learns to operate within “safe” bounds without being constantly overridden. You can further read papers by Achiam et al. (2017) "Constrained Policy Optimization", and García and Fernández’L (2015) "A Comprehensive Survey on Safe Reinforcement Learning".

Interventions implicitly encode the driver’s preferences such as comfort that the agent hasn’t learned, so we can also use Inverse Reinforcement Learning (IRL) from interventions to infer the human’s latent reward function (e.g., penalizing jerky motions) and retrain the agent with this reward. You can further refer papers by Hadfield-Menell et al., (2016) "Cooperative Inverse Reinforcement Learning", and Saunders et al., (2017) "Trial without Error: Towards Safe Reinforcement Learning via Human Intervention".

Finally we can view that the agent and human act as two agents with hierarchical objectives, so we can decompose the policy into a high-level “planner” and a low-level “controller”. The high-level policy might propose maneuvers while a low-level controller ensures smooth execution within safe parameters. Such architectures are common in robots and autonomous driving research and can help minimize situations that trigger human override. You can further read papers by Merel et al., (2019) "HIERARCHICAL VISUOMOTOR CONTROL OF HUMANOIDS" and Jain et al/, (2019) "Hierarchical Reinforcement Learning for Quadruped Locomotion".

What if actions from an RL agent are overriden?

1 Answers1