Why do we minimise the loss between the target Q values and 'local' Q values?

Question

I have a question regarding the loss function of target networks and current (online) networks. I understand the action value function. What I am unsure about is why we seek to minimise the loss between the qVal for the next state in our target network and the current state in the local network. The Nature paper by Mnih et al. is well explained, however, I am not getting from it the purpose of the above. Here is my training portion from a script I am running:

for state, action, reward, next_state, done in minibatch:
    target_qVal = self.model.predict(state)
# print(target_qVal)

if done:
    target_qVal[0][action] = reward #done
else:
    # predicted q value for next state from target model
    pred = self.target_model.predict(next_state)[0]
    target_qVal[0][action] = reward + self.gamma * np.amax(pred)

# indentation position?
self.model.fit(np.array(state), 
               np.array(target_qVal), 
               batch_size=batch_size,
               verbose=0, 
               shuffle=False, 
               epochs=1)

I understand that the expected return is the immediate reward plus the cumulative sum of discounted rewards looking into the future $s'$ (correct me if I'm wrong in my understanding) when following a given policy.

My fundamental misunderstanding is the loss equation:

$$L = [r + \gamma \max Q(s',a'; \theta') - Q(s,a; \theta)],$$

where $\theta'$ and $\theta$ are the weights of the target and online neural networks, respectively.

Why do we aim to minimize the Q value of the next state in the target model and the Q value of the current state in the online model?

A bonus question would be, in order to collect $Q(s,a)$ values for dimensionality reduction (as in Mnih et al t-sne plot), would I simply collect the target_qVal[0] values during training and feed them into a list after each step to accumulate the Q values over time?

score 1 · Answer 1 · answered Nov 12 '20 at 03:45

The loss function is designed in a way to approximate the bellman optimality for $Q^*(s,a)$. Given an optimal policy $\pi^*$, $Q^*(s,a)$ satisfies the equation $$Q^*(s,a) = r(s) + \gamma max_{a'}\sum_{s'}P(s'|s,a)Q^*(s',a')$$

At convergence, the highest $Q$ value that I can get taking action $a$ in state $s$ is equal to the reward I get for taking action $a$ and the max $Q$ values at the next state.

You can see in the loss function that the DQN tries to attain a $Q^*(s,a)$ value that closely approximates the equation above.

On a side note, in model based RL setting, the bellman optimality for $Q^*$ is expressed as $$Q^*(s,a) = r(s) + \gamma \sum_{s'}P(s'|s,a)V^*(s)$$ $V^*(s')$ is used in the model based case because $V^*(s)$ by definition represents the highest possible value attainable at state $s'$ following $\pi^*$. In model free setting, $V^*$ is replaced by $Q^*$ because computing $V^*$ is not useful in achieving model free control without a transition model $P(s'|s,a)$

Why do we minimise the loss between the target Q values and 'local' Q values?

1 Answers1