Is neural fictitious self play violating off-policy theorem

Question

I was reading the NFSP player from D. Silver, and I'm somewhat confused by the algorithm:

In particular, given that we sample an action according to best response ($\sigma = \epsilon-\text{greedy}(Q)$), we also insert this transition in $\mathcal{M}_{RL}$, over which then we will estimate the gradient for the policy $\pi$... however, since this action has not been sampled from the policy $\pi$, it biases the gradient, which usually should be corrected by the importance sampling ratio

what am I missing?

score 3 · Accepted Answer · answered Jul 19 '23 at 13:14

over which then we will estimate the gradient for the policy π ...

It seems like you might be misunderstanding the type of algorithm. This (NFSP) doesn't use return of $\Pi$ as a loss. If Heinrich & Silver were using a policy gradient method, they would, you would encounter the issue of the gradient being off-policy.

This is actually strictly a Q Learning algorithm. The policy of the MDP is epsilon-greedy based on our Q estimator. The policy function $\Pi$ is an estimator for $\epsilon\text{-greedy}(Q)$, and is trained using SL according to negative-log-likelihood.

What you would need to be careful of when implementing this NFSP is when you clear the replay memories. If you update your Q-network parameters $\theta^{Q} \leftarrow \theta^{Q'}$ (an easy mistake) and continue sampling, your trajectory distribution will be different. However, the algorithm described in the paper is entirely on-policy.

Is neural fictitious self play violating off-policy theorem

1 Answers1