This problem bothered me as well and I don't think the answer by Philip Raeisghasem above is satisfactory. Reducing variance is a desired property but one also has to show that the final result is correct.
Consider the general form of TD update
$$
Q_{t+n}(S_t,A_t) = Q_{t+n-1}(S_t,A_t) + \alpha \Delta
$$
The desired property for $\Delta$ is that under the behavior policy $\mu$ we have
$$
\mathbb E_\mu[\Delta] = 0
$$
so that we have convergence under the behavior policy and that this expression makes sense for the problem.
Now consider
$$
\mathbb E_\mu[ \rho_{t+1:t+n} G_{t:t+n}] = \mathbb E_\pi[ G_{t:t+n}]
$$
We also want to learn the state-action values for the target policy so we would like
$$
\mathbb E_\pi[ G_{t:t+n}] = \mathbb E_\pi[Q_{t+n-1}(S_t,A_t)]
$$
However, under the behavior policy
$$
\mathbb E_\pi[Q_{t+n-1}(S_t,A_t)]= \mathbb E_\mu[ \rho_{t+1:t+n-1} Q_{t+n-1}(S_t,A_t)]
$$
Extra steps of importance sampling do not affect the mean, and even though they increase variance, the resulting expression is often simpler. If we add an extra factor of $\rho_{t+n}$ to the expression for the state-action value we get the desired update rule using
$$
\Delta = \rho_{t+1:t+n} ( G_{t:t+n} - Q_{t+n-1})
$$
Of course, there are modifications of this expression to removed importance sampling expression in the future of individual rewards in the return that are discussed in section 7.4 of Sutton and Barto.