As I understand it, modern CPUs and GPUs are highly optimized for the following calculations:
- arithmetic on floating point numbers (8, 16, 32 or 64 bits)
- arithmetic on integers (8, 16, 32 or 64 bits)
So all arithmetic is optimized for bytes, not individual bits. Moreover, memory locations (pointers, address space) are also understood as a "byte addresses". You treat the memory as if composed of individual bytes, not bits.
If we load a quantized model (e.g. LLM) into memory (e.g. GPU memory), then the quantized weights must still be quantized when they are in the memory, because the whole point of quantization is to save memory, so we can run larger models in limited-memory scenarios.
But if they are quantized in memory, then a 5-bit quantized model will have a pretty strange memory layout: 5 bits will make up one number. One byte will be occupied by parts of different weights (e.g. AAAAABBB BBCCCCCD DDDDEEEE EFFFFFGG ...).
How do CPUs and GPUs handle such scenarios efficiently? When they load a number from memory, do they use pointers which are like "the byte address at 0x91827364, but only the last 3 bits, plus the first 2 bits of the adjacent byte"? In other words, do they use "bit pointers", or is there something else going on? Or do they always load two bytes and then just use bit-shift operators?