Cybenko showed that if $\sigma$ is a sigmoidal, continuous function, then for any $\varepsilon > 0$, for any continuous function $f: [0, 1]^d \to \mathbb{R}$, there exists a function of the form $g:x \mapsto \sum\limits_{i = 1}^n a_i\sigma\left( \langle w_i, x \rangle + b_i \right) + b$ such that $\forall x \in [0, 1]^d, \left| f(x) - g(x) \right| \leq \varepsilon$.
So each continuous function on a compact set can be approximated by neural networks with $1$ hidden layer, with activation function $\sigma$ for this layer and identity for the output layer.
My question is: how can we deduce from this result that for any fixed $k \geq 1$, any continuous function on $[0, 1]^d$ can be approximated by neural networks with $k$ hidden layers, using $\sigma$ (and identity at the end) ?
Basically, how to go from one hidden layer to $k$