3

I'm seeing different manners to define momentum, I'm not sure if there is significant difference or not.

From my thinking, they seem to do a similar thing mathematically and in practice but I'm curious to know if there's a significant difference that i'm not seeing.

1st Definition

Momentum: $v\theta_{t} = \beta * v\theta_{t-1} + (1 - \beta) * ∂\theta_t$

Update: $\theta = \theta - \alpha * v\theta$

2nd Definition

Momentum: $v\theta_t = \beta * v\theta_{t-1} - \alpha *∂\theta_t$

Update: $\theta = \theta + v\theta_t$

3rd Definition

Momentum: $v\theta_t = \beta * v\theta_{t-1} + ∂\theta_t$

Update: $\theta = \theta - \alpha * v\theta_t$

From my understanding, the first definition explicitly uses the exponentially weighted average with the $\beta$ hyperparameter, which then can increase the stability of the model when using momentum and makes $\beta$ more intuitive to tune.

I'm not sure why I'd want to use the 2nd or 3rd definitions over the 1st, nor if they have any significant difference.

vxnuaj
  • 125
  • 1
  • 6

1 Answers1

0

Since they're various kinds of the same idea of leveraging exponentially weighted average of the gradients, they may not have significant difference albeit each is most suitable for different cases.

Nowadays the usually standard definition of momentum is either your 1st or 3rd definition as referenced, and thus they're most similar, though obviously the primary difference is the scaling factor $(1−)$ in the 1st definition which helps in further reducing the contribution of the current gradient versus the past accumulated gradients for the momentum. This makes the 1st definition a bit more stable and less sensitive to the choice of $β$ which in practice is set as $0.9$ in most libraries and it's used in the popular and more adaptive Adam algo considering the propagation of root mean squared gradients as referenced above.

Your 2nd definition as referenced is the original momentum formulation of Polyak's 1964 article on solving functional equations and can be easily converted to a form similar to your 3rd definition, ie, $\theta=\theta - \alpha v_{\theta_t}$, where $v_{\theta_t} = -\frac{\beta}{\alpha} v_{\theta_{t-1}} + ∂\theta_t$. Therefore once you define a new $\beta$ to replace $-\frac{\beta}{\alpha}$, it's equivalent to your 3rd definition.

cinch
  • 11,000
  • 3
  • 8
  • 17