Over the past 2 weeks, I have got messages and comments to deep dive into the training costs of Deepseek. This topic garnered so much interest that markets reacted, discussions exploded, and some even claimed the AI bubble had burst.
Now that the dust has settled, letās take a step back and look at what the technical paper actually says.
šÆ Hereās what we do know from the DeepSeek-V3 paper:
š Pre-training:Ā 2,664K H800 GPU hours (~$5.328M)
šContext Length Extension:Ā 119K H800 GPU hours (~$0.238M)
šPost-Training (SFT + RLHF):Ā 5K H800 GPU hours (~$0.01M)
šTotal Training Cost:Ā 2.788M GPU hours (~$5.576M)
The paper clearly states that these costsĀ only cover the official trainingĀ of DeepSeek-V3. It doesĀ notĀ go deeper into costs associated withĀ prior research, ablation experiments on architectures, algorithms, or data. And thatās a big gap becauseĀ these costs are tough to estimateāthey depend on the number of experiments, dataset curation, and model variations tested.
If youāre building anĀ LLM from scratch, without using pre-trained components, these costs can be massive.
1ļøā£ Compute NeedsĀ ā Training on raw, unoptimized data takesĀ hugeĀ GPU hours.
2ļøā£ Longer Development CyclesĀ ā Teams must experiment with different architectures, optimizers, and data processing techniques.
3ļøā£ No Transfer LearningĀ ā Without leveraging existing models, everything must be learned from the ground up, requiring more data and iterations.
Thatās why training a next-gen modelāone that pushes us closer toĀ AGI or ASIādemandsĀ huge budgets. If a paper doesnāt provide details on these steps, itās tough to make anĀ apple-to-apple comparisonĀ with other models.
As a technologist, Iād be the happiest person to see a leading model trained at such a low cost. But to truly assess this claim, we need more data. Without it,Ā weāre just blindly following a headline.
Read the entire article here: