2

Cybenko showed that if $\sigma$ is a sigmoidal, continuous function, then for any $\varepsilon > 0$, for any continuous function $f: [0, 1]^d \to \mathbb{R}$, there exists a function of the form $g:x \mapsto \sum\limits_{i = 1}^n a_i\sigma\left( \langle w_i, x \rangle + b_i \right) + b$ such that $\forall x \in [0, 1]^d, \left| f(x) - g(x) \right| \leq \varepsilon$.

So each continuous function on a compact set can be approximated by neural networks with $1$ hidden layer, with activation function $\sigma$ for this layer and identity for the output layer.

My question is: how can we deduce from this result that for any fixed $k \geq 1$, any continuous function on $[0, 1]^d$ can be approximated by neural networks with $k$ hidden layers, using $\sigma$ (and identity at the end) ?

Basically, how to go from one hidden layer to $k$

nbro
  • 42,615
  • 12
  • 119
  • 217
JackEight
  • 123
  • 3

2 Answers2

1

To extend Cybenko's famous result from a single hidden layer to multiple hidden layers, you essentially need to show that any continuous function on a compact support is approximately decomposable into "smaller" continuous functions (not trivially the identity function) on a compact support recursively up to any fixed $k$ as you specified, each of which is already proved to be approximated universally with a single hidden layer.

The Stone–Weierstrass theorem can be helpful here.

In mathematical analysis, the Weierstrass approximation theorem states that every continuous function defined on a closed interval [a, b] can be uniformly approximated as closely as desired by a polynomial function... The Stone–Weierstrass theorem generalizes the Weierstrass approximation theorem in two directions: instead of the real interval [a, b], an arbitrary compact Hausdorff space X is considered, and instead of the algebra of polynomial functions, a variety of other families of continuous functions on ${\displaystyle X}$ are shown to suffice

Since continuous functions with a compact support can be universally approximated by the algebra of polynomials or similar families which are Lipschitz continuous primitive recursive functions, they could be further approximately represented by composition of primitive recursive functions. But this may not be always possible and there might be some limit for $k$ as such recursive non-trivial decomposition may not be always able to continue indefinitely, and bear in mind that many practical advantages of MLP over a fat single hidden layer are empirical findings without rigorous proof so far from deep learning of feature-rich highly nonlinear functions which are intuitively approximately decomposable into many layers.

cinch
  • 11,000
  • 3
  • 8
  • 17
1

Just set all the next layers to map the identity.


Either construct them saturating the weights, or consider that each layer is a 1-hidden-layer MLP which is a universal approximator, so it can learn the identity

Alberto
  • 2,863
  • 5
  • 12