0

Does 1-bit quantization machine learning exist?

Pytorch's docs on "Quantization" define it as:

techniques for performing computations and storing tensors at lower bitwidths than floating point precision.

torch.bool tensors exist with Pytorch, but what's the use for them if most layers use floats? I see the smallest bitwidth besides torch.bool is torch.quint4x2 (unsigned 4-bit integer), but only one sort of layer (nn.EmbeddingBag sparse layer) supports it.

Geremia
  • 555
  • 1
  • 5
  • 12

3 Answers3

1

1-bit quantization does exist, at least at the inference stage: a common approach is to constrain weights (and sometimes activations) to be -1 or +1. I'd recommend this survey paper for a good overview.

However, as @programjames mentioned, binarization during training is much more difficult (at least with traditional SGD) as SGD explores through "small and noisy steps", which isn't really possible with such low precision.

During inference, however, binary networks can have some nice properties, like being able to perform convolutions through XNOR and bit-counting operations.

Geremia
  • 555
  • 1
  • 5
  • 12
Alexander Wan
  • 1,409
  • 1
  • 12
1

Yes, 1-bit LLMs are a thing (although, technically, they are 1.58 bit LLMs because they use a ternary system using {-1, 0, +1}). This paper explores practical implementations of 1.58 bit LLMs.

1-bit quantization of LLMs is explored in this 2023 paper about BitNet models. They use BitLinear instead of the standard nn.Linear. This git page has the full implementation.

ahron
  • 265
  • 2
  • 7
0

Survey papers are a good way to explore prior research in this area, in addition to the paper mentioned in Alexander's answer see these papers:

Some of the newest work, with source code and a few implementations, are:

  1. "BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook", by Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo, 24 May 2025

"Abstract
 
"Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to ±1 for maximal memory and compu- tational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that lever-ages adaptive weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. ".

  1. "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models", by Elias Frantar, Dan Alistarh, 25 Oct 2023, GitHub. (0.8 bit)

"Abstract
 
"Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. Innthis paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference.".

  1. "BitNet b1.58 2B4T Technical Report", by Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei, 25 Apr 2025, GitHub (1.58-bit).

"Abstract
 
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.".

 

"Does 1-bit quantization machine learning exist?"

Yes, slightly over one bit, and sub-one bit.

Rob
  • 637
  • 1
  • 9
  • 23