I would like to find the GPU size required to run an hypothetical LLM, considering all possible factors, like:
- P: Model parameters (total or MoE active parameters)
- Q: Quantization bits
- C: Context length cap (from what I understand, the context can be capped to allow a sort of smaller "batch-size" limit)
- ATT: Type of attention used (Full attention, Flash attention...)
- Other
I understand how the usual formula I can find around
Space = ((P × 4Bytes) / (32 / Q)) × overhead
does describe some part of the picture, but does not give the full idea down to the details.
