Theorem: Convergence of Gradient Descent#

Conditions:

  1. Convexity: The cost function \(f(x)\) is convex, meaning for any two points \(x\) and \(y \) in the domain of \(f\), the following inequality holds for all \(\alpha \in [0, 1]\):

    \[f(\alpha x + (1 - \alpha) y) \leq \alpha f(x) + (1 - \alpha) f(y).\]

Following directly from this:

\[f(x )+\nabla f(x)(x-y) \leq f(y).\]
  1. Lipschitz Continuity of the Gradient: The gradient of \(f(x),\) denoted as \(\nabla f(x)\), is Lipschitz continuous, which means there exists an \(L > 0 \) such that:

    \[\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|\]

    for all \(x\) and \(y\) in the domain of \(f\).

Following from the Lipschitz condition and that the function is convex using the fundamental theorem of calculus it can be shown the the cost function satisfies: $\( f(y) \leq f(x)+(y-x)\nabla f(x)+\frac{L}{2}||y-x||^2.\)$

Theorem: Convergence Analysis of Gradient Descent for Convex and Smooth Functions

Let \( f: \mathbb{R}^n \rightarrow \mathbb{R} \) be a convex and continuously differentiable function. Assume the gradient of \( f \), denoted by \( \nabla f(x) \), is Lipschitz continuous with Lipschitz constant \( L > 0 \). If the learning rate \( \alpha \) satisfies \( 0 < \alpha \leq \frac{1}{L} \), then the sequence \( \{x_k\} \) generated by the gradient descent update rule

\[ x_{k+1} = x_k - \alpha \nabla f(x_k) \]

converges to the global minimum of \( f \), \(x^*=\arg \min_x(f(x))\).

Proof:

Due to the convexity of \( f \) this ensures that any local minimum is a global minimum: $\( f(x_i) \leq f(x^*)+\nabla f(x_i)(x_i+x^*)\)$.

The Lipschitz continuity condition implies that for any \( x, y \) in the domain of \( f \), the following inequality holds:

\[ \|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \]

which gives:

\[ f(x_{k+1}) \leq f(x_k) + \nabla f(x_k)(x_{k+1}-x_k)+\frac{L}{2}||x_{k+1}-x_{k}||^2 \]
\[ f(x_{k+1}) \leq f(x_k) -\frac{1}{L}|| \nabla f(x_k)||^2+\frac{1}{2L}||\nabla f(x_k)||^2 \]
\[ f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}||\nabla f(x_k)||^2. \]

Hence the function is monotonically decreasing. Extend this to the mimimum \(x^*\).

\[ f(x_{k+1}) \leq f(x^*) + \nabla f(x_k)(x_{k}-x^*) - \frac{1}{2L}||\nabla f(x_k)||^2 \]
\[ f(x_{k+1}) \leq f(x^*) +L||x_{k}-x_{k+1}|| ||x_{k}-x^*||- \frac{L}{2}||x_{k}-x_{k+1}||^2 \]
\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}\big(||x_{k}-x^*||^2-2||x_{k}-x_{k+1}||||x_{k}-x^*||+||x_{k}-x_{k+1}||^2\big) \]
\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}\big(||x_{k}-x^*||^2-\frac{2}{L}||\nabla f(x_{k})||||x_{k}-x^*||+\frac{1}{L^2}||\nabla f(x_{k})||^2\big) \]
\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}||x_{k}-x^*-\nabla f(x_{k})||^2 \]
\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}||x_{k+1}-x^*||^2 \]

Summing the above equation for \(k=0,...,n-1\) we get

\[ \Sigma_{k=0}^{n-1}(f(x_{k+1})-f(x^*))\leq \frac{L}{2}(||x_{0}-x^*||^2- ||x_{n}-x^*||^2 )\leq \frac{L}{2}||x_0-x^* ||^2.\]

As the function is non-increasing we have

\[f(x_k)-f(x^*)\leq f(x_i)-f(x^*) \]

for \(i<k\) which gives

\[k(f(x_k)-f(x^*))\leq \frac{L}{2}||x_0-x^* ||^2. \]

Proof Sketch:

  1. Convexity: First, note that the convexity of \( f \) ensures that any local minimum is a global minimum.

  2. Lipschitz Continuity of Gradient: The Lipschitz continuity condition implies that for any \( x, y \) in the domain of \( f \), the following inequality holds:

\( \|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \)

  1. Objective Function Decrease: Using Lipschitz continuity, one can show that the gradient descent update decreases the function value:

\( f(x_{k+1}) \leq f(x_k) - \alpha \langle \nabla f(x_k), \nabla f(x_k) \rangle + \frac{\alpha^2L}{2}\|\nabla f(x_k)\|^2 \)

  1. Bounding the Decrease: By exploiting the Lipschitz continuity of \( \nabla f \), one can further bound the decrease in the objective function:

\( f(x_{k+1}) - f(x_k) \leq -\frac{\alpha}{2} \|\nabla f(x_k)\|^2 \)

  1. Convergence: Using the bound on the decrease in the objective function, it can be shown that the sequence \( \{f(x_k)\} \) is monotonically decreasing and bounded from below, implying convergence. Furthermore, since \( \|\nabla f(x_k)\| \) approaches zero as \( k \) increases, the sequence \( \{x_k\} \) converges to a point where the gradient is zero, which is a stationary point of \( f \). Due to convexity, this stationary point is the global minimum.

Stochastic Gradient Descent (SGD)#

Let \( f: \mathbb{R}^n \rightarrow \mathbb{R} \) be a L-Lipschitz convex function and \(x^*=\arg \min_x(f(x))\). Consider an instance of SGD where the estimators \(v_i\) have bounded variance: for all \(i \geq 0,\) \(Var(v_i) ≤ \sigma^2\).

Then, for any \(k > 1\), SGD with step-size (learning rate \(\alpha ≤ 1/L\) satisifes

\[ E[f(\bar{x}_k)]\leq f(x^*) +\frac{||x_0 − x^*||_2}{2tk}+ \frac{t\sigma^2}{2},\]

where \(\bar{x}_k = (1/k)(x_1 + . . . + x_k )\). In particular, for \(k = (\sigma^2 + L||x_0 − x^*||^2)/\epsilon^2\) iterations suffice to find a \(2\epsilon\)-approximate optimal value—in expectation—\(x\) by setting \(t = 1/\sqrt{k}\).