3

The legacy LLMs have so much more compute power than DeepSeek yet they are comparable. If the efficiencies of DeepSeek get applied to the models that have significantly more compute power would that make them significantly better or would it have little effect?

Joe
  • 133
  • 3

1 Answers1

2

According to this online source, the DeepSeek-R1 model employs a Mixture of Experts (MoE) architecture, and despite having a massive 671 billion parameters in total, only 37 billion are activated per forward pass, making DeepSeek-R1 more resource-efficient than most similarly large models. The MoE approach ensures scalability without proportional increases in computational cost.

Specific details regarding the size and composition of the training dataset for R1 haven't been publicly disclosed. However, DeepSeek has emphasized the main use of innovative pure reinforcement learning such as GRPO and Chain-of-Thought (CoT) format reward along with supervised fine-tuning (SFT) cold start to fix readability and language mixing issues and rejection-sampling post-processing to optimize training efficiency.

Therefore, in theory, applying efficiency techniques to models with greater computational resources or compute power could potentially enhance their accuracy or other performance further in an inverse exponential fashion according to the power-law like neural scaling law or Chinchilla scaling.

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

cinch
  • 11,000
  • 3
  • 8
  • 17