How are these equations of SGD with momentum equivalent?

Question

I know this question may be so silly, but I can not prove it.

In Stanford slide (page 17), they define the formula of SGD with momentum like this:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

where:

$v_{t+1}$ is the momentum value
$\rho$ is a friction, let say it's equal 0.9
$\nabla f(x_{t-1})$ is the gradient of the objective function at iteration $t-1$
$x_t$ are the parameters
$\alpha$ is the learning rate

However, in this paper and many other documents, they define the equation like this:

$$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

where $\rho$ and $\alpha$ still have the same value as in the previous formula.

I think it should be

$$v_{t}=\alpha \rho v_{t-1}+\alpha \nabla f(x_{t-1})$$

if we want to multiply the learning rate inside the equation.

In some other document (this) or normal form of momentum, they define like this:

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$

I can not understand how can they prove those equations are similar. Can someone help me?

score 4 · Accepted Answer · answered Dec 31 '20 at 06:33

The first two equations are equivalent. The last equation can be equivalent if you scale $\alpha$ appropriately.

Equation 1

Consider the equation from the Stanford slide:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

Let's evaluate the first few $v_t$ so that we can arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \nabla f(x_0) = \nabla f(x_0)\\ v_2 = \rho v_1 + \nabla f(x_1) = \rho \nabla f(x_0) + \nabla f(x_1)\\ v_3 = \rho v_2 + \nabla f(x_2) = \rho^2 \nabla f(x_0) + \rho \nabla f(x_1) + \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i) $

So the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i)$$

Equation 2

Now consider the equation from the paper: $$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

We again evaluate the first few $v_t$ to arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \alpha \nabla f(x_0) = \alpha \nabla f(x_0)\\ v_2 = \rho v_1 + \alpha \nabla f(x_1) = \rho \alpha \nabla f(x_0) + \alpha \nabla f(x_1)\\ v_3 = \rho v_2 + \alpha \nabla f(x_2) = \rho^2 \alpha \nabla f(x_0) + \rho \alpha \nabla f(x_1) + \alpha \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i) $

So the closed from update is:

$$x_t = x_{t-1} - \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i)$$

As you can see, this is equivalent to the previous closed form update. The only difference is if $\alpha$ is inside or outside the summation, but since it is a constant, it doesn't really matter anyways.

Equation 3

As for the last equation

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$ Let's do the same thing:

$v_0 = 0 \\ v_1 = \rho v_0 + (1-\rho) \nabla f(x_0) = (1-\rho) \nabla f(x_0)\\ v_2 = \rho v_1 + (1-\rho) \nabla f(x_1) = \rho (1-\rho) \nabla f(x_0) + (1-\rho) \nabla f(x_1)\\ v_3 = \rho v_2 + (1-\rho) \nabla f(x_2) = \rho^2 (1-\rho) \nabla f(x_0) + \rho (1-\rho) \nabla f(x_1) + (1-\rho) \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i) $

And so the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i)$$

This equation is equivalent to the other two as long as you scale $\alpha$ by a factor of $\displaystyle \frac{1}{1-\rho}$.

How are these equations of SGD with momentum equivalent?

1 Answers1

Equation 1

Equation 2

Equation 3