I have a reinforcement learning environment with sparse rewards. Current methods such as PPO and DQN both fail to learn a policy that is suffuciently good. I may have a way to find trajectories that are satisfactory and are found while using no neural network. Is it possible that I then use those trajectories in the replay buffer of a DQN or PPO and then update the neural network a couple of times? This would initialise the neural network in kinda the right direction and would be a bit like imitation learning. Does someone know if this is a viable idea or if there is another way of doing this?
Asked
Active
Viewed 862 times
1 Answers
2
PPO is an on-policy algorithm so you must use trajectories generated by the current policy.
DQN is an off-policy algorithm, so you could add these trajectories to the buffer, but you also need "bad" trajectories so that you try to cover the state-action space. Of course you can't cover the full space, but you need both good and bad examples.
If you have a way of generating "good" trajectories, then maybe try some imitiation learning algorithms like behaviour cloning or DAgger.
You could also try using imitation learning only for pre-training, and once your policy is good enough to generate "reasonable" trajectories, you can train further with PPO.