4

I was going through David Silver's lecture on reinforcement learning (lecture 4). At 51:22 he says that Monte Carlo (MC) methods have high variance and zero bias. I understand the zero bias part. It is because it is using the true value of value function for estimation. However, I don't understand the high variance part. Can someone enlighten me?

nbro
  • 42,615
  • 12
  • 119
  • 217
Bhuwan Bhatt
  • 404
  • 2
  • 13

1 Answers1

2

When using terms like "high" for high variance, this is in comparison to other methods, mainly in comparison to TD learning, which bootstraps between single time steps.

It is worth spelling out what the variance applies to and where it comes from: Namely the Monte Carlo return $G_t$ distribution, which can be calculated as follows:

$$G_t = \sum_{k=0}^{T-t-1}\gamma^k R_{t+k+1}$$

I am using a common convention here by the way that capital letters such as $G_t$ stand for random distributions of variables and lower case letters such as $g_t$ stand for examples/samples from those distributions. It is not always strictly used, especially in pseudo-code, but is important to make clear for this answer.

So a value function, such as action value function $q_{\pi}(s,a)$ can be written:

$$q_{\pi}(s,a) = \mathbb{E}_{\pi} [ \sum_{k=0}^{T-t-1}\gamma^k R_{t+k+1} | S_t=s, A_t=a] = \mathbb{E}_{\pi} [ G_t | S_t=s, A_t=a]$$

i.e. it is fairly clear that the mean of Monte Carlo return $\mathbb{E}[G_t]$ from any specific starting point is the same as the value function from the same starting point, by definition (see Sutton & Barto or David Silver's definitions) so samples from it - taking specific measures $g_t$ - are unbiased estimates of it.

The variance $\text{Var}(G_t)$ might be written

$$\text{Var}(G_t) = \text{Var}(\sum_{k=0}^{T-t-1}\gamma^k R_{t+k+1}) \le \sum_{k=0}^{T-t-1}\gamma^k \text{Var}(R_{t+k+1})$$

The second step is technically equal only in a "worst case" scenario since to add variances directly, we need the variables to be independently sampled, whilst $R_{t}$ and $R_{t+1}$ could be correlated, and often are. So, technically we should account for covariance here between distribitions on different time steps. The maths for that is fiddly, and I won't show it here.

What you can say intuitively about the inequality, is unless $R_{t}$ and $R_{t+1}$ are perfectly correlated throughout each trajectory, then variance of $G_t$ will increase depending on the number of timesteps and possible rewards in the return, by some fraction between 0 and 1 of the total variance of all time steps. You can either add zero variance on a timestep, or some fraction of $\text{Var}(R_{t+n})$ depending on whether the relationship between [$R_t$ to $R_{t+n-1}$] and $R_{t+n}$ is fully deterministic ($+0$) or completely independent ($+\gamma^{n-1}\text{Var}(R_{t+n})$)*

The amount of covariance involved relies on details of the policy, state transition and reward functions. In general, the policy during RL will include some stochastic behaviour -such as $\epsilon$-greedy that will add to the variance on each time step. The state transition and reward functions might add yet more. In any case:

$$\text{Var}(G_t) \ge \text{Var}(R_{t+1})$$

and in practice, when $t+1 \neq T$, the inequality can be large, although highly dependent on the details of the MDP. For off-policy Monte Carlo with basic importance sampling in fact the variance can be unbounded.

This variance might be considered high, compared to the variance of typical single-step TD returns:

$$G_{t:t+1} = R_{t+1} + \gamma q_{\pi}(S_t, A_t)$$

The variance of which is limited to that inherent in a single time step (one reward function, one state transition and one policy decision). The problem of course being that

$$\mathbb{E}[G_{t:t+1}] \neq \mathbb{E}[G_{t}]$$

. . . unless $q_{\pi}(s, a)$ is perfect. Which is why we can say that TD returns are biased (whilst $q_{\pi}(s, a)$ is not converged, the expectation of the TD target does not equal the true value), but Monte Carlo returns are higher variance (when number of timesteps to end of trajectory is more than one, the effects of random factors in policy, state transition and reward function accumulate).


* You could explicitly construct an environment with covariance between time steps that can reduce overall variance of a MDP's retrun, e.g. the state includes reward so far, and future rewards are manipulated so that the return fits to a certain distribution. This would be a special case where variance of the return from some starting points could be lower than the variance from the first step, but this is an unusual set up and likely to be impossible for an agent to learn optimal control.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66