Filling replay buffer with expert trajectories for PPO/DQN

Question

I have a reinforcement learning environment with sparse rewards. Current methods such as PPO and DQN both fail to learn a policy that is suffuciently good. I may have a way to find trajectories that are satisfactory and are found while using no neural network. Is it possible that I then use those trajectories in the replay buffer of a DQN or PPO and then update the neural network a couple of times? This would initialise the neural network in kinda the right direction and would be a bit like imitation learning. Does someone know if this is a viable idea or if there is another way of doing this?

score 2 · Answer 1 · edited Sep 26 '23 at 19:09

PPO is an on-policy algorithm so you must use trajectories generated by the current policy.

DQN is an off-policy algorithm, so you could add these trajectories to the buffer, but you also need "bad" trajectories so that you try to cover the state-action space. Of course you can't cover the full space, but you need both good and bad examples.

If you have a way of generating "good" trajectories, then maybe try some imitiation learning algorithms like behaviour cloning or DAgger.

You could also try using imitation learning only for pre-training, and once your policy is good enough to generate "reasonable" trajectories, you can train further with PPO.

Filling replay buffer with expert trajectories for PPO/DQN

1 Answers1