0

I am having a closer look at the BitNet paper (arXiv:2310.11453v1).

For quantizing the activations to b-bit, they use absmax to quantize to the range $[-Q_b, +Q_b]$.

So,

$$ \tilde{x} = Quant(x) = Clip(x \times \frac{Q_b}{\gamma}, -Q_b + \epsilon, +Q_b - \epsilon) $$

where $Clip(x,a,b) = max(a,min(b,x))$ and $\gamma = {\lVert x \rVert}_{infty}$.

This is all good, but then they write (at the end of page 3): For the activations before the non-linear functions (e.g., ReLU), we scale them into the range $[0, Q_b ]$

This I don't get. As I understand, after the ReLU, the activations are all positive. Before the ReLU, they can be +/- ve. So I would expect that the activations before the ReLU are quantized to $[-Q_b, +Q_b]$ and those after the ReLU to $[0, +Q_b]$.

I'm obviously missing something. Can someone please explain?

ahron
  • 265
  • 2
  • 7

0 Answers0