Figure 1: Gradient Descent with momentum on a convex function
Figure 2: Gradient Descent with momentum on a non-convex function
We saw how we can use Gradient Descent to find minimum of a function. Gradient Descent with Momentum is a variation which can be useful in certain situations.
This algorithm is analogous to rolling a ball down a slope. Ball slowly accumulates momentum and rolls faster and faster.
Momentum helps in crossing some of the local minima if the function is not a convex function. This we can see in Figure 2.
Mathematically this algorithm can be written as: $$ v_{n+1} = \beta * v_n + \frac {d f(x)}{dx_n} $$ $$ x_{n+1} = x_n - \alpha * v_{n+1} $$
Here $\beta$ is momentum term and $\alpha$ is learning rate.
Momentum term $\beta$ is usually set to 0.9 or nearby value. If we set $\beta$ to 0, we arrive at gradient descent formula.