32

The Wikipedia article for the universal approximation theorem cites a version of the universal approximation theorem for Lebesgue-measurable functions from this conference paper. However, the paper does not include the proofs of the theorem. Does anybody know where the proof can be found?

nbro
  • 42,615
  • 12
  • 119
  • 217
Leroy Od
  • 485
  • 1
  • 5
  • 4

3 Answers3

33

There are multiple papers on the topic because there have been multiple attempts to prove that neural networks are universal (i.e. they can approximate any continuous function) from slightly different perspectives and using slightly different assumptions (e.g. assuming that certain activation functions are used). Note that these proofs tell you that neural networks can approximate any continuous function, but they do not tell you exactly how you need to train your neural network so that it approximates your desired function. Moreover, most papers on the topic are quite technical and mathematical, so, if you do not have a solid knowledge of approximation theory and related fields, they may be difficult to read and understand. Nonetheless, below there are some links to some possibly useful articles and papers.

The article A visual proof that neural nets can compute any function (by Michael Nielsen) should give you some intuition behind the universality of neural networks, so this is probably the first article you should read.

Then you should probably read the paper Approximation by Superpositions of a Sigmoidal Function (1989), by G. Cybenko, who proves that multi-layer perceptrons (i.e. feed-forward neural networks with at least one hidden layer) can approximate any continuous function. However, he assumes that the neural network uses sigmoid activations functions, which, nowadays, have been replaced in many scenarios by ReLU activation functions. Other works (e.g. [1], [2]) showed that you don't necessarily need sigmoid activation functions, but only certain classes of activation functions do not make neural networks universal.

The universality property (i.e. the ability to approximate any continuous function) has also been proved in the case of convolutional neural networks. For example, see Universality of Deep Convolutional Neural Networks (2020), by Ding-Xuan Zhou, which shows that convolutional neural networks can approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. See also Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets (by A. Heinecke et al., 2020)

See also page 632 of Recurrent Neural Networks Are Universal Approximators (2006), by Schäfer et al., which shows that recurrent neural networks are universal function approximators. See also On the computational power of neural nets (1992, COLT) by Siegelmann and Sontag. This answer could also be useful.

For graph neural networks, see Universal Function Approximation on Graphs (by Rickard Brüel Gabrielsson, 2020, NeurIPS)

nbro
  • 42,615
  • 12
  • 119
  • 217
7

"Modern" Guarantees for Feed-Forward Neural Networks

My answer will complement nbro's above, which gave a very nice overview of universal approximation theorems for different types of commonly used architectures, by focusing on recent developments specifically for feed-forward networks. I'll try an emphasis depth over breadth (sometimes called width) as much as possible. Enjoy!


Part 1: Universal Approximation

Here I've listed a few recent universal approximation results that come to mind. Remember, universal approximation asks if feed-forward networks (or some other architecture type) can approximate any (in this case continuous) function to arbitrary accuracy (I'll focus on the : uniformly on compacts sense).

Let me mention, that there are two types of guarantees: quantitative ones and qualitative ones. The latter are akin to Hornik's results (Neural Networks - 1989) which simply state that some neural networks can approximate a given (continuous) function to arbitrary precision. The former of these types of guarantees quantifies the number of parameters required for a neural network to actually perform the approximation and are akin to Barron's (now) classical paper (IEEE - 1993)'s breakthrough results.

  1. Shallow Case: If you want quantitative results only for shallow networks: Then J. Siegel and J. Xu (Neural Networks - 2020) will do the trick but (note: The authors deal with the Sobolev case but you get the continuous case immediately via the Soblev-Morrey embedding theorem.)

  2. Deep (not narrow) ReLU Case: If you want a quantitative proof for deep networks (but not too narrow) with ReLU activation function then Dimity Yarotsky's result (COLT - 2018) will do the trick!

  3. Deep and Narrow: To the best of my knowledge, the first quantitative proof for deep and narrow neural networks with general input and output spaces has recently appeared here: https://arxiv.org/abs/2101.05390 (preprint - 2021).
    The article is a constructive version of P. Kidger and T. Lyon's recent deep and narrow universal approximation theorem (COLT - 2020) (qualitative) for functions from $\mathbb{R}^p$ to $\mathbb{R}^m$ and A. Kratsios and E. Bilokpytov's recent Non-Euclidean Universal Approximation Theorem (NeurIPS - 2020).


Part 2: Memory Capacity

A related concept is that of "memory capacity of a deep neural network".
These results seek to quantify the number of parameters needed for a deep network to learn (exactly) the assignment of some input data $\{x_n\}_{n=1}^N$ to some output data $\{y_n\}_{n=1}^N$. For example; you may want to take a look here:

  1. Memory Capacity of Deep ReLU networks: R. Vershynin's very recent publication Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations - (SIAM's SIMODS 2020)
AB_IM
  • 634
  • 1
  • 7
  • 15
2

Just wanted to add that the new text Deep Learning Architectures A Mathematical Approach mentions this result, but I'm not sure if it gives a proof. It does mention an improved result by Hanin (http://arxiv.org/abs/1708.02691) for which I think it does give at least a partial proof. The original paper by Hanin seems to omit some proofs as well, but the published version (https://www.mdpi.com/2227-7390/7/10/992/htm) may be more complete.