0

Quantization is the mapping of values in a high-precision representation to a low-precision one. I observed that the either the weights of the model or the activation values, or both could be quantized in different techniques. While quantization is mainly used for reducing the memory usage of the model, there are also mentions online that some types of quantization could improve latency and/or throughput. Thus, I am looking for these kinds of quantization techniques.

As an example implementation for quantization, according to A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes, the LLM.int8() quantization uses fp16 weights during inference, with the original model weights being fp32. For matrix multiplication, outlier values are computed using fp16, while the non-outlier values are quantized to int8 before the multiplication and then de-quantized to fp16 afterwards. It is found that "BLOOM-176B with LLM.int8() is about 15% to 23% slower than the fp16 version". I presume that this is due to additional operations, including furthur quantization and de-quantization in the matrix multiplications. However, I think these operations could be removed if I want to optimize for speed while sacrificing accuracy.

Regarding throughput, I think by reducing the memory usage of the activation values, you could store a larger batch in the memory and thus at least increase the overall speed of batched inference. Though this may not guarantee the throughput of individual inputs.

As for latency, or maybe also the the throughput of individual inputs, my guess is that some hardware are be optimized for fast low-precision calculation? I found mentions that some NVIDIA GPUs provide faster inferencing using FP8. Also, the Optimum-NVIDIA provides faster inference using FP8.

Therefore, as I am currently using NVIDIA A800 and RTX 4090 for inference, with vLLM as the inference library, what are the recommended quantization techniques and tools that I can use for improving latency and throughput in inference? (I am fine with changing the software, but the hardware would be hard to change.)

1 Answers1

1

LLM inference benefits from Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). INT8, FP16, and 4-bit quantization significantly reduce memory usage while leveraging Tensor Cores and optimized libraries (TensorRT, FasterTransformer) to accelerate inference. Weight-only and mixed-precision quantization further balance performance and accuracy.