Gradient Descent vs L-BFGS-B

Gradient Descent: We saw the basic details of Gradient Descent in the previous example. Gradient descent is defined as first-order iterative optimization algorithm to find minima of a differentiable function.
Mathematical definition of gradient descent is: $$ x_{n+1} = x_n - \alpha * \frac {d f(x)}{dx_n} $$

$\alpha$ is learning rate. The value of $x_{n+1}$ is computed iteratively until it converges at a minimum.
It is called as a first-order because we use the first order differentiation of the function.

Newton’s Methods use the second order differentiation of function to converge at the minima faster. To decide the next point on the gradient, Newtons methods use the Hessian matrix.
Hessian matrix for a function $ f(x) $ is defined as:

$$H_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & … & \frac{\partial^2 f}{\partial x_1 \partial x_n}\\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & … & \frac{\partial^2 f}{\partial x_2 \partial x_n}\\ . & . & & . \\ . & . & & . \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & … & \frac{\partial^2 f}{\partial x_n^2}\\ \end{bmatrix}$$

or by using the indices i and j:

$$(H_f)_{i,j} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Newton’s method performs the below iteration to compute the next point on the gradient: $$ x_{n+1} = x_n - \alpha H^{-1} \frac {d f(x)}{dx_n} $$

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is a quasi-Newton method. Quasi-Newton methods are alternative to Newton’s method. The expensive operation in Newton’s method is computation of inverse of Hessian matrix ($H^{-1}$). Instead Quasi-Newton methods compute a matrix which is an approximation to Hessian matrix. However BFGS still uses a $n \times n$ memory to store the approximation matrix. If the machine learning model has large number of parameters (hundreds of thousands to millions), this will be very large to fit in memory.

Limited Memory BFGS (L-BFGS) is an improvement over BFGS to optimize memory usage. L-BFGS uses the approximation to Hessian matrix approach. But instead of storing a large $n \times n$ matrix as approximation to Hessian, it stores a set of vectors. This reduces memory usage significantly and it can used as optimization algorithm to build large machine learning models.

L-BFGS-B extends L-BFGS to additionally provide support for boundary constraints on variables. We will use the L-BFGS-B algorithm provided in scipy library to minimize functions and compare it with gradient descent from Tensorflow library.

Below animations show the difference between gradient descent and L-BFGS to reach to the minimum for two functions:

Figure 1: Gradient Descent vs L-BFGS-B on function $ y = x^2 $ where $ x \epsilon (-10, 10) $

Figure 2: Gradient Descent vs L-BFGS-B on function $ y = e ^ {-x/2} $ where $ x \epsilon (-10, 0) $

References:

Deep Learning Book Section 8.6.1 covers Stochastic Gradient Descent and Section 8.6.3 covers BFGS, L-BFGS
Wikiepdia articles on LBFGS, BFGS, Newton’s Methods, Quasi-Newton Methods, Hessian Matrix

Gradient Descent vs L-BFGS-B

Implementation of Gradient Descent and L-BFGS-B for $y = x^2$

Implementation of Gradient Descent and L-BFGS-B for $y = e ^ {-x/2}$