8

Based on some preliminary research, the conjugate gradient method is almost exactly the same as gradient descent, except the search direction must be orthogonal to the previous step.

From what I've read, the idea tends to be that the conjugate gradient method is better than regular gradient descent, so if that's the case, why is regular gradient descent used?

Additionally, I know algorithms such as the Powell method use the conjugate gradient method for finding minima, but I also know the Powell method is computationally expensive in finding parameter updates as it can be run on any arbitrary function without the need to find partial derivatives of the computational graph. More specifically, when gradient descent is run on a neural network, the gradient with respect to every single parameter is calculated in the backward pass, whereas the Powell method just calculates the gradient of the overall function at this step from what I understand. (See scipy's minimize, you could technically pass an entire neural network into this function and it would optimize it, but there's no world where this is faster than backpropagation)

However, given how similar gradient descent is to the conjugate gradient method, could we not replace the gradient updates for each parameter with one that is orthogonal to its last update? Would that not be faster?

nbro
  • 42,615
  • 12
  • 119
  • 217
Recessive
  • 1,446
  • 10
  • 21

2 Answers2

9

When dealing with optimization problems, a fundamental distinction is whether the objective is a (deterministic) function, or an expectation of some function. I will refer to these cases as the deterministic and stochastic setting respectively.

Almost always machine learning problems are in the stochastic setting. Gradient descent is not used here (and indeed, it performs poorly, which is why it is not used); rather it is stochastic gradient descent, or more specifically, mini-batch stochastic gradient descent (SGD) that is the "vanilla" algorithm. In practice however, methods such as ADAM (or related methods such as AdaGrad or RMSprop) or SGD with momentum are preferred over SGD.

The deterministic case should be thought of separately, as the algorithms used there are completely different. It's interesting to note that the deterministic algorithms are much more complicated than their stochastic counterparts. Conjugate gradient is definitely going to be better on average than gradient descent, however it is quasi-Newton methods, such as BFGS (and its variants such as l-BFGS-b) or a truncated method that are currently considered state of the art.

Here's a NIPs paper that says CG doesn't generalize well. There are similar results for quasi-Newton methods. If you want something better than SGD, you should look into a method like ADAM, which was designed for the stochastic setting. CG and ADAM both use information from past gradient directions to improve the current search direction. CG is formulated assuming that the past gradients are the exact gradient. ADAM is formulated assuming that the past gradients are gradient estimates, which is the case in the stochastic setting.

Taw
  • 1,321
  • 5
  • 12
4

The fundamental issue is that one doesn't really want to find an optimum of one's optimization problem. We are really interested in generalization - not optimality. And we still poorly understand how and why neural models generalize so well.

Now, it looks like the generalization properties of neural models have something to do with the structure of their optimization landscape and some particular properties of their well-generalizing minima. Empirically, the SGD class of optimizers is better at finding such generalizing minima.

This paper illustrates these ideas by talking about "wide flat" minima and showing how one can use SGD with stochastic weight averaging to improve generalization and convergence.

Kostya
  • 2,667
  • 12
  • 24