2

Why is neural networks being a deterministic mapping not always considered a good thing?

So I'm excluding models like VAEs since those aren't entirely deterministic. I keep thinking about this and my conclusion is that often times neural networks are used to model things in reality, which often time do have some stochasticity and since neural networks are deterministic if they are not trained on enough examples of the possible variance inputs in relation to outputs can have they cannot generalize well. Are there other reasons this is not a good thing?

nbro
  • 42,615
  • 12
  • 119
  • 217
user8714896
  • 825
  • 1
  • 9
  • 24

1 Answers1

1

Your intuition is right. The main reason why a deterministic function can be undesirable (or even dangerous, as I will explain below with an example) is that we may not have enough data to learn the correct function, so we may end up learning the incorrect one. Right now, no other reason, from a theoretical point of view, comes to my mind, but below I will mention a few applications/cases where a deterministic function may not be desirable.

If we had all data pairs $\{(x_i, y_i)\}$, where $x_i \in \mathcal{X}$ and $y_i = \mathcal{Y}$ are, respectively, an input and output from the unknown function that you want to learn $f: \mathcal{X} \rightarrow \mathcal{Y}$, i.e. $f(x_i) = y_i$, then you could reconstruct $f$: whenever $x_i$ is given, you just need to return $y_i$.

Of course, in reality, we almost never have a large enough (training) dataset to approximate out desired (but usually unknown) function. If we learn only one (deterministic) function, then, in principle, you can catastrophically fail, i.e. your approximation of $f$, denoted as $f_\theta$ (where $\theta$ are the parameters of the neural network or any other model), can produce outputs that are completely wrong.

Let me try to give you a simple example. Let's say that $f$ is defined as follows

$$f: \mathbb{N} \rightarrow \{0, 1\}$$

You are given a training labelled dataset $$D = \{(4, 1), (11, 0), (8, 1), (31, 0), (16, 1), (7, 0) \}.$$

Apparently, our unknown function is defined as

\begin{align} h_1(x)= \begin{cases} 1, &x \text{ mod } 2 \equiv 0\\ 0, &\text{otherwise} \end{cases}\tag{1}\label{1} \end{align} Given that $D$ is small, your neural network, $f_\theta$, can easily overfit $D$, i.e. learn to output $1$ when $x$ is even and $0$ otherwise.

However, what if $f$ is not that function in equation \ref{1} and we collected just a dataset that doesn't represent $f$ well enough? If you look at $D$ more carefully, you will see that another possible hypothesis for $f$ is the following

\begin{align} h_2(x)= \begin{cases} 1, &x \text{ mod } 4 \equiv 0\\ 0, &\text{otherwise} \end{cases}\tag{2}\label{2} \end{align} However, given that your neural network can only compute one of these functions at a time, it could compute the wrong one. Let's say that $f_\theta \approx h_1$, then it should produce $1$ when $x = 6$ (an even number). If the correct unknown function was $h_2$, i.e. $f = h_2$, then $f_\theta(6) = 1$ would be wrong (because $6$ is not a multiple of $4$).

Of course, this is just a toy example. However, there are many other cases where this can happen, which may not be desirable, such as healthcare, medicine or self-driving cars, where the wrong prediction can lead to catastrophic consequences, such as the death of a person.

If we maintain a probability distribution over the possible functions that are consistent with the observed data so far, we can (partially) avoid this issue. So, continuing with the example above, this probability distribution over functions should be highly uncertain about $x = 6$, whether it produces $0$ or $1$, because it has never seen the label for $x=6$, so a medical doctor or the human driver could intervene in the case of (high) uncertainty.

For this reason, in the last decade, people have started to incorporate uncertainty estimation in neural networks. Neural networks that model uncertainty (to some degree) are often called Bayesian neural networks (BNNs), and there are different approaches (such as variational BNNs, MC dropout or Monte Carlo-based approaches). If you are interested in this topic, the paper Weight Uncertainty in Neural Network (2015) is a good start, especially if you are already familiar with VAEs. Given that this is a very new research area, the current solutions are still not very satisfactory. For example, you can find examples in the literature that report that MC dropout can produce very bad estimates of uncertainty (even in my master's thesis I have observed and thus concluded that this is the case), i.e. they can be highly certain when they should be highly uncertain.

nbro
  • 42,615
  • 12
  • 119
  • 217