Questions tagged [quantization]
5 questions
1
vote
2 answers
How do quantized models manage to be fast while still being quantized in memory?
As I understand it, modern CPUs and GPUs are highly optimized for the following calculations:
arithmetic on floating point numbers (8, 16, 32 or 64 bits)
arithmetic on integers (8, 16, 32 or 64 bits)
So all arithmetic is optimized for bytes, not…
Jonas Sourlier
- 161
- 5
0
votes
0 answers
Activations quantization in BitNet paper
I am having a closer look at the BitNet paper (arXiv:2310.11453v1).
For quantizing the activations to b-bit, they use absmax to quantize to the range $[-Q_b, +Q_b]$.
So,
$$
\tilde{x} = Quant(x) = Clip(x \times \frac{Q_b}{\gamma}, -Q_b + \epsilon, …
ahron
- 265
- 2
- 7
0
votes
1 answer
What types of quantization will improve LLM inference latency and throughput?
Quantization is the mapping of values in a high-precision representation to a low-precision one. I observed that the either the weights of the model or the activation values, or both could be quantized in different techniques. While quantization is…
Tom Lin
- 1
0
votes
0 answers
Is a quantized Mistral LLM slower than non-quantized Mistral LLM?
For reference, I've been playing around with Mistral 7B v0.1 and v0.3 but did not like that I was limited by A100 availability on google colab. so I wanted to try 4 bit and 8 bit quantized models. but they are drastically slow.
at first it thought…
0
votes
3 answers
Does 1-bit quantization (layers with boolean tensors) machine learning exist?
Does 1-bit quantization machine learning exist?
Pytorch's docs on "Quantization" define it as:
techniques for performing computations and storing tensors at lower bitwidths than floating point precision.
torch.bool tensors exist with Pytorch, but…
Geremia
- 555
- 1
- 5
- 12