1

I was reading the NFSP player from D. Silver, and I'm somewhat confused by the algorithm:

enter image description here

In particular, given that we sample an action according to best response ($\sigma = \epsilon-\text{greedy}(Q)$), we also insert this transition in $\mathcal{M}_{RL}$, over which then we will estimate the gradient for the policy $\pi$... however, since this action has not been sampled from the policy $\pi$, it biases the gradient, which usually should be corrected by the importance sampling ratio

what am I missing?

Alberto
  • 2,863
  • 5
  • 12

1 Answers1

3

over which then we will estimate the gradient for the policy π ...

It seems like you might be misunderstanding the type of algorithm. This (NFSP) doesn't use return of $\Pi$ as a loss. If Heinrich & Silver were using a policy gradient method, they would, you would encounter the issue of the gradient being off-policy.

This is actually strictly a Q Learning algorithm. The policy of the MDP is epsilon-greedy based on our Q estimator. The policy function $\Pi$ is an estimator for $\epsilon\text{-greedy}(Q)$, and is trained using SL according to negative-log-likelihood.

What you would need to be careful of when implementing this NFSP is when you clear the replay memories. If you update your Q-network parameters $\theta^{Q} \leftarrow \theta^{Q'}$ (an easy mistake) and continue sampling, your trajectory distribution will be different. However, the algorithm described in the paper is entirely on-policy.

Venna Banana
  • 406
  • 1
  • 3