0

I'm trying to build a DQN to replicate the DeepMind results. I'm doing with a simple DQN for the moment, but it isn't learning properly: after +5000 episodes, it couldn't get more than 9-10 points. Each episode has a limit of 5000 steps but it couldn't reach more than 500-700. I think the problem is in the replay function, which is:

def replay(self, replay_batch_size, replay_batcher):
    j = 0
    k = 0
    replay_action = []
    replay_state = []
    replay_next_state = []
    replay_reward= []
    replay_superbatch = []
if len(memory) < replay_batch_size:
    replay_batch = random.sample(memory, len(memory))
    replay_batch = np.asarray(replay_batch)
    replay_state_batch, replay_next_state_batch, reward_batch, replay_action_batch = replay_batcher(replay_batch)
else:
    replay_batch = random.sample(memory, replay_batch_size)
    replay_batch = np.asarray(replay_batch)
    replay_state_batch, replay_next_state_batch, reward_batch, replay_action_batch = replay_batcher(replay_batch)

for j in range ((len(replay_batch)-len(replay_batch)%4)):

    if k <= 4:
        k = k + 1              
        replay_state.append(replay_state_batch[j])
        replay_next_state.append(replay_next_state_batch[j])
        replay_reward.append(reward_batch[j])
        replay_action.append(replay_action_batch[j])

    if k >=4:                
        k = 0
        replay_state = np.asarray(replay_state)
        replay_state.shape = shape
        replay_next_state = np.asarray(replay_next_state)
        replay_next_state.shape = shape
        replay_superbatch.append((replay_state, replay_next_state,replay_reward,replay_action))

        replay_state = []
        replay_next_state = []
        replay_reward = []
        replay_action = []

states, target_future, targets_future, fit_batch = [], [], [], []

for state_replay, next_state_replay, reward_replay, action_replay in replay_superbatch:

    target = reward_replay
    if not done:
        target = (reward_replay + self.gamma * np.amax(self.model.predict(next_state_replay)[0]))

    target_future = self.model.predict(state_replay)

    target_future[0][action_replay] = target
    states.append(state_replay[0])
    targets_future.append(target_future[0])
    fit_batch.append((states, targets_future))

history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)

loss = history.history['loss'][0]

if self.exploration_rate > self.exploration_rate_min:

    self.exploration_rate -= (self.exploration_rate_decay/1000000)
return loss

What I'm doing is to get 4 experiences (states), concatenate and introduce them in the CNN in shape (1, 210, 160, 4). Am I doing something wrong? If I implement the DDQN (Double Deep Q Net), should I obtain similar results as in the DeepMind Breakout video? Also, I'm using the Breakout-v0 enviroment from OpenAI gym.

Edit

Am I doing this properly? I implemented an identical CNN; then I update the target each 100 steps and copy the weights from model CNN to target_model CNN. Should it improve the learning? Anyway I'm getting low loss.

for state_replay, next_state_replay, reward_replay, action_replay in replay_superbatch:
        target = reward_replay
        if not done:

            target = (reward_replay + self.gamma * np.amax(self.model.predict(next_state_replay)[0]))
        if steps % 100 == 0:

            target_future = self.target_model.predict(state_replay)

            target_future[0][action_replay] = target
            states.append(state_replay[0])
            targets_future.append(target_future[0])
            fit_batch.append((states, targets_future))
            agent.update_net()

    history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)

    loss = history.history['loss'][0]

Edit 2

So as far I understand, this code should work am I right?

if not done:
        target = (reward_replay + self.gamma * np.amax(self.target_model.predict(next_state_replay)[0]))
        target.shape = (1,4)

        target[0][action_replay] = target
        target_future = target
        states.append(state_replay[0])
        targets_future.append(target_future[0])
        fit_batch.append((states, targets_future))

    if step_counter % 1000 == 0:

        target_future = self.target_model.predict(state_replay)

        target_future[0][action_replay] = target
        states.append(state_replay[0])
        targets_future.append(target_future[0])
        fit_batch.append((states, targets_future))
        agent.update_net()

history = self.model.fit(np.asarray(states), np.array(targets_future), epochs=1, verbose=0)

nbro
  • 42,615
  • 12
  • 119
  • 217
JCP
  • 173
  • 1
  • 12

1 Answers1

2

It looks like on each step, you're calling both self.model.predict and self.model.fit. If you do this, you're going to run into stability problems, since your learning target is moving as you train.

The way the DQN paper gets around this problem is by using 2 Q-networks, $Q$ and $\hat{Q}$, where $\hat{Q}$ is called the target network. The target network's parameters are frozen, and its outputs are used to compute the learning targets for $Q$ (targets_future in your code). Every $C$ training steps (where $C$ is a hyperparameter), the target network $\hat{Q}$ is updated with the weights of $Q$. See Algorithm 1 on Page 7 of the DQN paper for the details of this swap.