10

There is a popular story regarding the back-of-the-envelope calculation performed by a British physicist named G. I. Taylor. He used dimensional analysis to estimate the power released by the explosion of a nuclear bomb, simply by analyzing a picture that was released in a magazine at the time.

I believe many of you know some nice back-of-the-envelope calculations performed in machine learning (more specifically neural networks). Can you please share them?

nbro
  • 42,615
  • 12
  • 119
  • 217
Charles
  • 291
  • 2
  • 6

3 Answers3

1

I think a nice back-of-the envelope calculation is the intuition for exploding/vanishing gradients in RNNs:

Simplifications

  • diagonalisable weights $U$ and $W$
  • no non-linearities
  • 1 layer

This gives a hidden state $h_t$ at timestep $t$ for input $x_t$: $h_t = W\cdot h_{t-1} + U\cdot x_t$

Let $L_t$ be the loss at timestep $t$ and the total loss $L = \sum_t L_t$. Then (eq. 3 -> 5 in the paper)

$$ \frac{\partial L_t}{\partial W} \sim = \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W} = \sum_{k=1}^{t}\frac{\partial h_t}{\partial h_k}\times\alpha_{t, k} $$

Let's not care about terms regrouped in $\alpha_{t, k}$:

$$ \frac{\partial h_t}{\partial h_k} = \prod_{k<i\leq t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{k<i\leq t} W = \prod_{k<i\leq t} PDP^\top = PD^{t-k}P^{\top} $$

So you can easily see$^1$ that if the eigen values of $W$ (in the diagonal matrix $D$) are larger than $1$, the gradient will explode with time, and if they are smaller than $1$, it will vanish.

More detailed derivations in On the difficulty of training recurrent neural networks


$^1$ remember $\lim_{n \to +\infty}|x^n| = +\infty$ if $|x|>1$ and $=0$ for $|x| < 1$

ted
  • 276
  • 1
  • 4
1

One neat thing is to estimate the GPU usage when training an ML model. This can be relevant for training or fine-tuning large models. If inference with one instance takes $N$ GB of vRAM, then one training step with batch size $B$ requires $2 \times B \times N $ GB of vRAM.


Note. The factor of 2 is there because forward pass computes activations and stores them, while the backward pass needs to store both activations and gradients for computing parameter updates through backpropagation. The exact memory usage can vary based on the network architecture and the specific operations involved, but a rule of thumb is that it's approximately 2x the memory of the forward pass.

Ron
  • 111
  • 3
1

I have one to share. This is no formula, but a general thing I have noticed.

The number of neurons + neurons should be proportionate, in some way, to the complexity of the classification.

Although this is fairly basic and widely known, it has helped me in many times to consider one thing: how many at a minimum does it need?

FreezePhoenix
  • 442
  • 3
  • 20