2

I would like to find the GPU size required to run an hypothetical LLM, considering all possible factors, like:

  • P: Model parameters (total or MoE active parameters)
  • Q: Quantization bits
  • C: Context length cap (from what I understand, the context can be capped to allow a sort of smaller "batch-size" limit)
  • ATT: Type of attention used (Full attention, Flash attention...)
  • Other

I understand how the usual formula I can find around

Space = ((P × 4Bytes) / (32 / Q)) × overhead

does describe some part of the picture, but does not give the full idea down to the details.

rikyeah
  • 121
  • 2

1 Answers1

0

(V)RAM requirement for inference: Model size < RAM requirements < Model size * 1.2.

https://huggingface.co/spaces/hf-accelerate/model-memory-usage is helpful, e.g.:

enter image description here

Franck Dernoncourt
  • 3,473
  • 2
  • 21
  • 39