Why is DeepSeek's inference on HuggingFace so slow (compared to Qwen, Llama, etc.)?
1 Answers
DeepSeek’s models are known to be trained mainly through RL for its chain-of-thought (CoT) reasoning and thus are encouraged to produce potentially much longer reasoning chain comparatively speaking, even if the total number of activated parameters is reduced with its distillation, the cost may remain high. So DeepSeek models excel in specialized tasks like math and coding, but this specialization may come at the cost of general inference speed. In contrast, Qwen and Llama prioritize modularity and multilingual support, which are often paired with optimizations for faster response times.
DeepSeek models utilize a new Mixture-of-Experts (MoE) architecture, which further introduces computational overhead due to dynamic parameter routing and significant VRAM memory requirement during inference and possibly negatively impact quantization optimization for inference latency. Some other architectures like Qwen or Llama have received extensive quantization optimization engineering work for deployment.
Finally DeepSeek models are engineered to handle extremely long context windows (up to 128K tokens in some cases) compared to your mentioned other LLMs.
- 11,000
- 3
- 8
- 17