3

This is comes from cs285 2023Fall hw3. enter image description here

In my opinion, if $\hat{Q}$ is unbiased estimate of $Q$, then $$ \begin{align} \mathbb{E}_{D \sim P}[B_{D}\hat{Q} - B_{D}Q] &= \mathbb{E}_{D \sim P}[r(s,a) + \gamma max_{a'}\hat{Q}(s', a') - r(s,a) - \gamma max_{a'}Q(s', a')]\\ &=\mathbb{E}_{D \sim P}[\gamma max_{a'}\hat{Q}(s', a')- \gamma max_{a'}Q(s', a')]\\ &= 0 \end{align}$$ So $\hat{Q}$ is unbiased estimate of $Q$ and the answer is yes. Is this right?

nbro
  • 42,615
  • 12
  • 119
  • 217
yeebo xie
  • 77
  • 6

2 Answers2

4

No, $B\hat{Q}$ is not an unbiased estimator. Say that $Q(s, a) = 0$, and $\hat{Q}(s, a) \sim N(0, 1)$. Then $E[max_a \space Q(s, a)] = 0$, but $E[max_a \space \hat{Q}(s, a)] > 0$ (approximately 1.5387 if you have 10 different actions).

user1502040
  • 141
  • 3
3

Given your provided definition of unbiased estimator at each input data, and since $\hat{Q}$ is an unbiased estimator for the true action value $Q$ at each state-action input pair, applying Bellman backup operator defined in your reference involving a max operator would result in biased $B\hat{Q}$ not converging to $BQ$ in expectation at each state-action input pair. The distribution of the maximum of $n$ independent i.i.d. random variables doesn't necessarily follow the same distribution, it may follow Gumbel distribution for example, which may not have same expectation as the original distribution, as pointed out by another answer.

Of course in reality TD off-policy Q-learning cannot ensure there always exists such an unbiased $\hat{Q}$ estimator at each state-action input.

cinch
  • 11,000
  • 3
  • 8
  • 17