🚀 Post 2 of 3 – Deepseek’s training strategies and optimisation

Date: January 30, 2025Author: Vignesh Kumar 0 Comments

Irrespective of the ongoing debate around DeepSeek’s model release—whether from an ethical or IP infringement standpoint—it’s worth understanding some of the key concepts discussed in the paper. I am sure we will get to hear a bit more on this debate in the coming days as there is extensive scrutiny being done from a IP infringement perspective.

Today, let’s focus on training strategies and optimizations—how DeepSeek-V3 makes training faster, cheaper, and more efficient.

1️⃣ FP8 Mixed Precision Training
DeepSeek-V3 uses FP8 (8-bit floating point) instead of higher precision formats like BF16 for most computations. Why? Because smaller numbers need less memory and can be processed faster, as long as they are rounded carefully.

👉 Simple example: Think of FP8 like using shorthand when taking notes. You dont capture the entire information. You write faster and use less paper, but you still capture the key information.

2️⃣ DualPipe Algorithm for Parallel Training
DeepSeek-V3 speeds up training by ensuring different stages overlap. Instead of waiting for one step to finish before starting the next, computation and communication happen at the same time—reducing delays.

👉 Simple exmaple: Imagine an assembly line of a car where workers multitask and handle different parts of a product at the same time. Instead of waiting for one person to finish everything before moving to the next, tasks flow smoothly without bottlenecks.

3️⃣ Efficient Cross-Node Communication
When training large AI models, different parts of the model are often distributed across multiple computers. DeepSeek-V3 optimizes this communication by limiting unnecessary data transfers and using high-speed channels like InfiniBand and NVLink.

👉 Simple example: It’s like setting up hotlines for messages between offices so that only the most important information is sent quickly, avoiding congestion.

4️⃣ Memory-Saving Techniques
Training massive models requires huge amounts of memory. DeepSeek-V3 saves memory by recalculating values instead of storing them and offloading some data to CPU memory when not in active use.

👉 Simple example: Imagine solving a long math problem. Instead of writing down/storing every intermediate step, you redo quick calculations (eg. you dont need to have a step to multiple 2*2, this is so simple that you can do it faster on the fly) when needed—saving space while still getting the right answer.

These are a few important optimisation approach that make DeepSeek-V3 faster and more efficient while keeping costs down.

In tomorrow’s last post on this topic, I will highlight the post training refinements that were done. Stay tuned!!!

Read complete document below:

2412.19437v1 Download

Vignesh Kumar

AI Advisor | Start-up Mentor | Tedx & Keynote Speaker | LinkedIn Top Voice '24 | Building AI Community Pair.AI | Director – Orange Business, Cisco, VMware | Cloud – SaaS & IaaS

🚀 Post 2 of 3 – Deepseek’s training strategies and optimisation

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply