This is super general because I don’t really know how deep learning is used for harder problems. I wanted it to apply in as general a case as possible.

  • Let \(0\leq n\in\mathbb Z\).
  • Let \(V_k^{(d_k)}\) be an inner product space with orthonormal basis \(e_1^{(k)},\dots,e_{d_k}^{(k)}\) for \(k=0,\dots,n+1\).
  • Let \(b_k\in V_k\) and \(W_k\in\mathcal L(V_{k-1},V_k)\) for \(k=1\dots,n\).
  • Let \(\sigma:\mathbb R\to\mathbb R\) be such that
    • it is continuous
    • monotonically increasing (not necessarily strictly)
    • \(\sigma’(t)\leq 1\) whereever the derivative exists. Note that the above properties already guarantee that the derivative exists (Lebesgue) almost everywhere.
  • for \(v\in V_k\) such that \(v=c_1e^{(k)}_1+\cdots+c_{d_k}e^{(k)}_{d_k}\), define \(\sigma(v):=\sigma(c_1)e^{(k)}_1+\cdots+\sigma(c_{d_k})e^{(k)}_{d_k}\).
  • Let \(x_0\in V_0\) be an input vector and define \(x_k = \sigma(W_kx_{k-1}+b_k)\) for \(k=1,\dots,n+1\).


  • \(|\sigma(s)-\sigma(t)|\leq|s-t|\) because \(\sigma\) is continuous, monotonically increasing, and differentiable a.e. with \(\sigma’(t)\leq 1\)
  • If \(\|\tilde x_0 - x_0\| < \varepsilon\), then

    \begin{align*} \|\tilde x_{n+1}-x_{n+1}\| &= \|\sigma(W_n\tilde x_n + b_n) - \sigma(W_nx_n+b_n)\| \\
    &\leq \|W_n\tilde x_n + b_n - W_nx_n - b_n\| \\
    &= \|W_n(\tilde x_n - x)\| \\
    &\leq \|W_n\|\|\tilde x_n - x_n\|\\
    &\leq \cdots \\
    &\leq \|W_n\|\cdots\|W_1\|\|\tilde x_0-x_0\| \\
    &< \|W_n\|\cdots\|W_1\|\varepsilon \end{align*}