Questions tagged [quantization]

5 questions
1
vote
2 answers

How do quantized models manage to be fast while still being quantized in memory?

As I understand it, modern CPUs and GPUs are highly optimized for the following calculations: arithmetic on floating point numbers (8, 16, 32 or 64 bits) arithmetic on integers (8, 16, 32 or 64 bits) So all arithmetic is optimized for bytes, not…
0
votes
0 answers

Activations quantization in BitNet paper

I am having a closer look at the BitNet paper (arXiv:2310.11453v1). For quantizing the activations to b-bit, they use absmax to quantize to the range $[-Q_b, +Q_b]$. So, $$ \tilde{x} = Quant(x) = Clip(x \times \frac{Q_b}{\gamma}, -Q_b + \epsilon, …
0
votes
1 answer

What types of quantization will improve LLM inference latency and throughput?

Quantization is the mapping of values in a high-precision representation to a low-precision one. I observed that the either the weights of the model or the activation values, or both could be quantized in different techniques. While quantization is…
0
votes
0 answers

Is a quantized Mistral LLM slower than non-quantized Mistral LLM?

For reference, I've been playing around with Mistral 7B v0.1 and v0.3 but did not like that I was limited by A100 availability on google colab. so I wanted to try 4 bit and 8 bit quantized models. but they are drastically slow. at first it thought…
0
votes
3 answers

Does 1-bit quantization (layers with boolean tensors) machine learning exist?

Does 1-bit quantization machine learning exist? Pytorch's docs on "Quantization" define it as: techniques for performing computations and storing tensors at lower bitwidths than floating point precision. torch.bool tensors exist with Pytorch, but…
Geremia
  • 555
  • 1
  • 5
  • 12