Theorem: Convergence of Gradient Descent

Theorem: Convergence of Gradient Descent#

Conditions:

Convexity: The cost function $f(x)$ is convex, meaning for any two points $x$ and $y $ in the domain of $f$, the following inequality holds for all $\alpha \in [0, 1]$:

\[f(\alpha x + (1 - \alpha) y) \leq \alpha f(x) + (1 - \alpha) f(y).\]

Following directly from this:

\[f(x )+\nabla f(x)(x-y) \leq f(y).\]

Lipschitz Continuity of the Gradient: The gradient of $f(x),$ denoted as $\nabla f(x)$, is Lipschitz continuous, which means there exists an $L > 0 $ such that:

\[\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|\]

for all $x$ and $y$ in the domain of $f$.

Following from the Lipschitz condition and that the function is convex using the fundamental theorem of calculus it can be shown the the cost function satisfies: $$ f(y) \leq f(x)+(y-x)\nabla f(x)+\frac{L}{2}||y-x||^2.$$

Theorem: Convergence Analysis of Gradient Descent for Convex and Smooth Functions

Let $ f: \mathbb{R}^n \rightarrow \mathbb{R} $ be a convex and continuously differentiable function. Assume the gradient of $ f $, denoted by $ \nabla f(x) $, is Lipschitz continuous with Lipschitz constant $ L > 0 $. If the learning rate $ \alpha $ satisfies $ 0 < \alpha \leq \frac{1}{L} $, then the sequence $ \{x_k\} $ generated by the gradient descent update rule

\[ x_{k+1} = x_k - \alpha \nabla f(x_k) \]

converges to the global minimum of $ f $, $x^*=\arg \min_x(f(x))$.

Proof:

Due to the convexity of $ f $ this ensures that any local minimum is a global minimum: $$ f(x_i) \leq f(x^*)+\nabla f(x_i)(x_i+x^*)$$.

The Lipschitz continuity condition implies that for any $ x, y $ in the domain of $ f $, the following inequality holds:

\[ \|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \]

which gives:

\[ f(x_{k+1}) \leq f(x_k) + \nabla f(x_k)(x_{k+1}-x_k)+\frac{L}{2}||x_{k+1}-x_{k}||^2 \]

\[ f(x_{k+1}) \leq f(x_k) -\frac{1}{L}|| \nabla f(x_k)||^2+\frac{1}{2L}||\nabla f(x_k)||^2 \]

\[ f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}||\nabla f(x_k)||^2. \]

Hence the function is monotonically decreasing. Extend this to the mimimum $x^*$.

\[ f(x_{k+1}) \leq f(x^*) + \nabla f(x_k)(x_{k}-x^*) - \frac{1}{2L}||\nabla f(x_k)||^2 \]

\[ f(x_{k+1}) \leq f(x^*) +L||x_{k}-x_{k+1}|| ||x_{k}-x^*||- \frac{L}{2}||x_{k}-x_{k+1}||^2 \]

\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}\big(||x_{k}-x^*||^2-2||x_{k}-x_{k+1}||||x_{k}-x^*||+||x_{k}-x_{k+1}||^2\big) \]

\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}\big(||x_{k}-x^*||^2-\frac{2}{L}||\nabla f(x_{k})||||x_{k}-x^*||+\frac{1}{L^2}||\nabla f(x_{k})||^2\big) \]

\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}||x_{k}-x^*-\nabla f(x_{k})||^2 \]

\[ f(x_{k+1}) \leq f(x^*) +\frac{L}{2} ||x_{k}-x^*||^2- \frac{L}{2}||x_{k+1}-x^*||^2 \]

Summing the above equation for $k=0,...,n-1$ we get

\[ \Sigma_{k=0}^{n-1}(f(x_{k+1})-f(x^*))\leq \frac{L}{2}(||x_{0}-x^*||^2- ||x_{n}-x^*||^2 )\leq \frac{L}{2}||x_0-x^* ||^2.\]

As the function is non-increasing we have

\[f(x_k)-f(x^*)\leq f(x_i)-f(x^*) \]

for $i<k$ which gives

\[k(f(x_k)-f(x^*))\leq \frac{L}{2}||x_0-x^* ||^2. \]

Proof Sketch:

Convexity: First, note that the convexity of $ f $ ensures that any local minimum is a global minimum.
Lipschitz Continuity of Gradient: The Lipschitz continuity condition implies that for any $ x, y $ in the domain of $ f $, the following inequality holds:

$ \|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| $

Objective Function Decrease: Using Lipschitz continuity, one can show that the gradient descent update decreases the function value:

$ f(x_{k+1}) \leq f(x_k) - \alpha \langle \nabla f(x_k), \nabla f(x_k) \rangle + \frac{\alpha^2L}{2}\|\nabla f(x_k)\|^2 $

Bounding the Decrease: By exploiting the Lipschitz continuity of $ \nabla f $, one can further bound the decrease in the objective function:

$ f(x_{k+1}) - f(x_k) \leq -\frac{\alpha}{2} \|\nabla f(x_k)\|^2 $

Convergence: Using the bound on the decrease in the objective function, it can be shown that the sequence $ \{f(x_k)\} $ is monotonically decreasing and bounded from below, implying convergence. Furthermore, since $ \|\nabla f(x_k)\| $ approaches zero as $ k $ increases, the sequence $ \{x_k\} $ converges to a point where the gradient is zero, which is a stationary point of $ f $. Due to convexity, this stationary point is the global minimum.

Stochastic Gradient Descent (SGD)#

Let $ f: \mathbb{R}^n \rightarrow \mathbb{R} $ be a L-Lipschitz convex function and $x^*=\arg \min_x(f(x))$. Consider an instance of SGD where the estimators $v_i$ have bounded variance: for all $i \geq 0,$ $Var(v_i) ≤ \sigma^2$.

Then, for any $k > 1$, SGD with step-size (learning rate $\alpha ≤ 1/L$ satisifes

\[ E[f(\bar{x}_k)]\leq f(x^*) +\frac{||x_0 − x^*||_2}{2tk}+ \frac{t\sigma^2}{2},\]

where $\bar{x}_k = (1/k)(x_1 + . . . + x_k )$. In particular, for $k = (\sigma^2 + L||x_0 − x^*||^2)/\epsilon^2$ iterations suffice to find a $2\epsilon$-approximate optimal value—in expectation—$x$ by setting $t = 1/\sqrt{k}$.

Theorem: Convergence of Gradient Descent

Contents

Theorem: Convergence of Gradient Descent#

Stochastic Gradient Descent (SGD)#