3

I know this question may be so silly, but I can not prove it.

In Stanford slide (page 17), they define the formula of SGD with momentum like this:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

where:

  • $v_{t+1}$ is the momentum value
  • $\rho$ is a friction, let say it's equal 0.9
  • $\nabla f(x_{t-1})$ is the gradient of the objective function at iteration $t-1$
  • $x_t$ are the parameters
  • $\alpha$ is the learning rate

However, in this paper and many other documents, they define the equation like this:

$$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

where $\rho$ and $\alpha$ still have the same value as in the previous formula.

I think it should be

$$v_{t}=\alpha \rho v_{t-1}+\alpha \nabla f(x_{t-1})$$

if we want to multiply the learning rate inside the equation.

In some other document (this) or normal form of momentum, they define like this:

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$

I can not understand how can they prove those equations are similar. Can someone help me?

CuCaRot
  • 932
  • 5
  • 16

1 Answers1

4

The first two equations are equivalent. The last equation can be equivalent if you scale $\alpha$ appropriately.

Equation 1

Consider the equation from the Stanford slide:

$$ v_{t}=\rho v_{t-1}+\nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t}, $$

Let's evaluate the first few $v_t$ so that we can arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \nabla f(x_0) = \nabla f(x_0)\\ v_2 = \rho v_1 + \nabla f(x_1) = \rho \nabla f(x_0) + \nabla f(x_1)\\ v_3 = \rho v_2 + \nabla f(x_2) = \rho^2 \nabla f(x_0) + \rho \nabla f(x_1) + \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i) $

So the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \nabla f(x_i)$$

Equation 2

Now consider the equation from the paper: $$ v_{t}=\rho v_{t-1}+\alpha \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}- v_{t}, $$

We again evaluate the first few $v_t$ to arrive at a closed form solution:

$v_0 = 0 \\ v_1 = \rho v_0 + \alpha \nabla f(x_0) = \alpha \nabla f(x_0)\\ v_2 = \rho v_1 + \alpha \nabla f(x_1) = \rho \alpha \nabla f(x_0) + \alpha \nabla f(x_1)\\ v_3 = \rho v_2 + \alpha \nabla f(x_2) = \rho^2 \alpha \nabla f(x_0) + \rho \alpha \nabla f(x_1) + \alpha \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i) $

So the closed from update is:

$$x_t = x_{t-1} - \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} \alpha \nabla f(x_i)$$

As you can see, this is equivalent to the previous closed form update. The only difference is if $\alpha$ is inside or outside the summation, but since it is a constant, it doesn't really matter anyways.

Equation 3

As for the last equation

$$ v_{t}= \rho v_{t-1}+ (1- \rho) \nabla f(x_{t-1}) \\ x_{t}=x_{t-1}-\alpha v_{t} $$ Let's do the same thing:

$v_0 = 0 \\ v_1 = \rho v_0 + (1-\rho) \nabla f(x_0) = (1-\rho) \nabla f(x_0)\\ v_2 = \rho v_1 + (1-\rho) \nabla f(x_1) = \rho (1-\rho) \nabla f(x_0) + (1-\rho) \nabla f(x_1)\\ v_3 = \rho v_2 + (1-\rho) \nabla f(x_2) = \rho^2 (1-\rho) \nabla f(x_0) + \rho (1-\rho) \nabla f(x_1) + (1-\rho) \nabla f(x_2)\\ \dots \\ v_t = \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i) $

And so the closed form update is:

$$x_t = x_{t-1} - \alpha \displaystyle \sum_{i=0}^{t-1} \rho^{t-1-i} (1-\rho) \nabla f(x_i)$$

This equation is equivalent to the other two as long as you scale $\alpha$ by a factor of $\displaystyle \frac{1}{1-\rho}$.

user3667125
  • 1,700
  • 9
  • 16