Gradient Descent

Gradient descent is a technique for iteratively minimizing a convex function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) by repeatedly taking steps along its gradient. We define the gradient of \(f\) to be the unique function \(\nabla f\) that satisfies:

\[lim_{p \rightarrow 0} \frac{f(x+p) - f(x) - \nabla f(x)^{T}p}{\|p\|} = 0\]

The gradient of \(f\) is a vector which points in the direction of maximum increase of \(f\). As a corollary, the negative gradient \(- \nabla f(x)\) points in the direction of maximum decrease of \(f\). Intuitively, the gradient descent procedure minimizes \(f\) by repeatedly taking steps in this direction of maximum decrease. Formally we pick a starting point \(x_0\) and a step size \(\alpha\) and iterate the following procedure:

\[x_{t+1} = x_t - \alpha * \nabla f(x_t)\]

If \(\nabla f\) is Lipschitz continuous and \(\alpha\) is small enough then \(f(x_{t+1})\) is guaranteed to be less than \(f(x_t)\).

Gradient Descent as Euler’s Method

There is another perspective from which we can derive the gradient descent procedure. Say we have a differential equation of the form \(\frac{dx}{dt} = g(x)\). We can use Euler’s method to solve this equation by choosing a starting point \(x_0\) and iteratively applying the following approximation for a step size \(\alpha\) until we reach a steady state \(x_s\) such that \(g(x_s) \simeq 0\):

\[x_{t+\alpha} \simeq x_t + \alpha * \frac{dx}{dt} \simeq x_t + \alpha * g(x_t)\]

Now say \(g(x) = -\nabla f(x)\). Applying Euler’s method to this equation yields:

\[x_{t+\alpha} = x_{t} - \alpha * \nabla f(x_t)\]

Which is exactly the update step for gradient descent! Note that this differential equation reaches a steady state at \(x_s\) such that \(\nabla f(x_s) = 0\). By definition, this must be an extrema of \(f\).