What made your DDPG implementation on your environment work?

Question

I am working on scheduling problem that has inherent randomness. The dimensions of action and state spaces are 1 and 5 respectively.

I am using DDPG, but it seems extremely unstable, and so far it isn't showing much learning. I've tried to

adjust the learning rate,
clip the gradients,
change the size of the replay buffer,
different neural net architectures, using SGD and Adam,
change the $\tau$ for the soft-update.

So, I'd like to know what people's experience is with this algorithm, for the environments where it was tested on the paper but also for other environments. What values of hyperparameters worked for you? Or what did you do? How cumbersome was the fine-tuning?

I don't think my implementation is incorrect, because I pretty much replicated this, and every other implementation I found did exactly the same.

(Also, I am not sure this is necessarily the best website to post this kind of question, but I decided to give a shot.)

user5093249 · Accepted Answer · 2020-05-26T13:35:12.200

Below are some tweaks that helped me accelerate the training of DDPG on a Reacher-like environment:

Reducing the neural network size, compared to the original paper. Instead of:

2 hidden layers with 400 and 300 units respectively

I used 128 units for both hidden layers. I see in your implementation that you used 256, maybe you could try reducing this.

As suggested in the paper, I added batch normalization:

... to manually scale the features so they are in similar ranges across environments and units. We address this issue by adapting a recent technique from deep learning called batch normalization (Ioffe & Szegedy, 2015). This technique normalizes each dimension across the samples in a minibatch to have unit mean and variance.

The implementation you used does not seem to include this.

Reducing the value of $\sigma$, which is a parameter of the Ornstein-Uhlenbeck process used to enable exploration. Originally, it was $0.2$, I used $0.05$. (I can't find where this parameter is set in your code.)

I am not entirely sure if this will help in your environment, but it was just to give you some ideas.

PS: Here is a link to the code I followed to build DDPG and here is a plot of rewards per episode.

What made your DDPG implementation on your environment work?

1 Answers1