3

I have a reinforcement learning environment with sparse rewards. Current methods such as PPO and DQN both fail to learn a policy that is suffuciently good. I may have a way to find trajectories that are satisfactory and are found while using no neural network. Is it possible that I then use those trajectories in the replay buffer of a DQN or PPO and then update the neural network a couple of times? This would initialise the neural network in kinda the right direction and would be a bit like imitation learning. Does someone know if this is a viable idea or if there is another way of doing this?

Erik Storm
  • 71
  • 2

1 Answers1

2

PPO is an on-policy algorithm so you must use trajectories generated by the current policy.

DQN is an off-policy algorithm, so you could add these trajectories to the buffer, but you also need "bad" trajectories so that you try to cover the state-action space. Of course you can't cover the full space, but you need both good and bad examples.

If you have a way of generating "good" trajectories, then maybe try some imitiation learning algorithms like behaviour cloning or DAgger.

You could also try using imitation learning only for pre-training, and once your policy is good enough to generate "reasonable" trajectories, you can train further with PPO.

Mithical
  • 2,965
  • 5
  • 28
  • 39
pi-tau
  • 995
  • 6
  • 12