1

As I understand it, modern CPUs and GPUs are highly optimized for the following calculations:

  • arithmetic on floating point numbers (8, 16, 32 or 64 bits)
  • arithmetic on integers (8, 16, 32 or 64 bits)

So all arithmetic is optimized for bytes, not individual bits. Moreover, memory locations (pointers, address space) are also understood as a "byte addresses". You treat the memory as if composed of individual bytes, not bits.

If we load a quantized model (e.g. LLM) into memory (e.g. GPU memory), then the quantized weights must still be quantized when they are in the memory, because the whole point of quantization is to save memory, so we can run larger models in limited-memory scenarios.

But if they are quantized in memory, then a 5-bit quantized model will have a pretty strange memory layout: 5 bits will make up one number. One byte will be occupied by parts of different weights (e.g. AAAAABBB BBCCCCCD DDDDEEEE EFFFFFGG ...).

How do CPUs and GPUs handle such scenarios efficiently? When they load a number from memory, do they use pointers which are like "the byte address at 0x91827364, but only the last 3 bits, plus the first 2 bits of the adjacent byte"? In other words, do they use "bit pointers", or is there something else going on? Or do they always load two bytes and then just use bit-shift operators?

2 Answers2

2

You are right. The technique commonly used today is weight packing. At the implementation level, a group of weights is packed into a specific number of bytes and then unpacked for tensor computation if needed. For example, in a model that requires 2-bit quantization, four weights can be packed into a single byte. Thus, a possible pack size for the weights in this case would be 4.

Additionally, the pack size can be larger depending on implementation details closer to the hardware level. For example, we can set the pack size to align with the block size of the hardware for better efficiency.

You might find more details and references related to this topic in this reference ("A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms"). You might also this Medium post useful for the matter of implementation of the technique.

Note that quantization factors like five, which are not the power of two, might be challenging (An aged concern in binary computers!).

OmG
  • 1,866
  • 12
  • 19
2

Quantized models improve speed by reducing precision (FP32 → INT8/INT4), lowering memory footprint and enabling faster arithmetic operations. Hardware like NVIDIA Tensor Cores efficiently process lower-bit computations, reducing latency while maintaining accuracy through techniques like distillation and calibration. This allows real-time inference on GPUs with minimal performance loss.