How the proof of the contraction of variance for distributional Bellman operator follows

Question

I am stuck at the proof of the contraction of variance for distributional Bellman operator from the paper, in which it is defined as

and the proof is stated as

In its second part, how is the variance of the target distribution equal to the expectation of the variance terms over the next state-action pairs?

Do you mean why $\mathbb{V}(P^\pi Z_1(x, a)) - \mathbb{V}(P^\pi Z_2x, a))$ turns into $\mathbb{E}[\mathbb{V}(Z_1(X', A')) - \mathbb{V}(Z_2(X', A'))]$? — nbro, Jan 30 '23 at 14:35

score 1 · Answer 1 · answered Mar 01 '23 at 12:18

The confusing part is how the variance is defined in terms of $P^{\pi}Z(x, a)$, which includes three sources of randomness, $X', A', Z$. However, if we define the variance as

\begin{align*} \mathbb{V}(P^{\pi}Z(x, a)) & \triangleq \int P(X' = x', A' = a') \mathbb{V} (Z | x', a')dx'da' \\ & = \mathbb{E}_{X', A'}[\mathbb{V}(Z|X', A')] \end{align*}

Then the result clearly follows as both $P^{\pi}Z_{1}$ and $P^{\pi}Z_{2}$ evolve the same dynamics, that is, on $X', A'$. By linearity of the expectation, the statement concludes.

pythonic833 · Answer 2 · 2023-03-01T10:33:49.267

This is a common trick in reinforcement learning literature which uses the law of large numbers to use the sampled variables $X$ and $A$ instead of $x$ and $a$. Let's say we know the probability $p(x)$ with which a continuous $x$ is given, then we can calculate the expectation value of $x$ as: \begin{align} \mathbb{E}_{x \sim p}[x] &= \int p(x)x ~dx \\ &= \lim \limits_{N \to \infty}\frac{1}{N}\sum_{N} X \end{align} In the last line, we did not assume any knowledge of $p(x)$ but we averaged over the samples taken from $p(x)$. This would be the only way to find the expectancy value for an unknown system, e.g. the average speed of a driver, of whom we don't know the exact driving behaviour. We can also use the sampled variables for an expression of the variance \begin{align} \mathbb{V}(x) = \int p(x)(x - \mathbb{E}_{x \sim p}(x))² dx = \lim \limits_{N \to \infty} \frac{1}{N} \sum_{N} (X-\mathbb{E}[X])² \end{align} From here, two things could happen, since the notation in the paper is not quite clear.

Define $\mathbb{V}[X] = (X-\mathbb{E}[X])²$ then \begin{equation} \lim \limits_{N \to \infty} \frac{1}{N} \sum_{N} (X-\mathbb{E}[X])² = \lim \limits_{N \to \infty} \frac{1}{N} \sum_{N} \mathbb{V}[X] = \mathbb{E}[{\mathbb{V}[X]}]. \end{equation} But this would be a rather confusing notation since $(X-\mathbb{E}[X])²$ is only sample whose expecation value is the variance.
Since the variance already is an expectation value (the expected, absolute difference from the mean), we can add an additional expectation value, which basically does not do anything: \begin{equation} \lim \limits_{N \to \infty} \frac{1}{N} \sum_{N} (X-\mathbb{E}[X])² = \mathbb{V}[X] = \mathbb{E}[\mathbb{V}[X]] = \lim \limits_{N \to \infty} \frac{1}{N} \sum_{N} \mathbb{V}[X] = \lim \limits_{N \to \infty} \frac{1}{N} N\mathbb{V}[X] = \mathbb{V}[X]. \end{equation} I lean towards this explanation, but still introducing a useless expectation value is quite strange.

Now, let's use this in order to explain why: \begin{align} \mathbb{V}(P^{\pi}Z_{1}(x, a)) - \mathbb{V}(P^{\pi}Z_{2}(x, a)) \\ = \int \pi(a) p(x)(P^{\pi}Z_{1}(x, a) - \mathbb{E}_{x \sim p, a\sim \pi}[{P^{\pi}Z_{1}(x, a)}])^{2}~dxda - \int \pi(a) p(x)(P^{\pi}Z_{2}(x, a) - \mathbb{E}_{x \sim p, a\sim \pi}[{P^{\pi}Z_{2}(x, a)}])^{2}~dxda \\ = \lim \limits_{N \to \infty} \frac{1}{N} \sum (Z_{1}(X', A') - \mathbb{E}[Z_{1}(X', A')])^{2} - (Z_{2}(X', A') - \mathbb{E}[Z_{2}(X', A')])^{2} \\ = \mathbb{E}[\mathbb{V}(Z_{1}(X', A')) - \mathbb{V}(Z_{2}(X', A'))] \\ \end{align} Where in the last step one can use any of the two possibilities given above. I abbreviated $p(x') = p(x'|x, a)$ and $\pi(a) = \pi(a|x)$.

How the proof of the contraction of variance for distributional Bellman operator follows

2 Answers2