BestAIFor.com
Paper

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

M
Matthieu Morel
May 11, 202613 min read
Share:
DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

DeepSeek V4's Full Paper Is Out: What the FP4 QAT Details and Stability Tricks Actually Mean

TL;DR

DeepSeek's V4 technical paper documents a working FP4 quantization-aware training pipeline for a frontier-scale mixture-of-experts model — a meaningful step beyond the FP8 approach the V3 paper introduced. The inference efficiency gains are real and the methodology is specific enough to be reproduced. Whether those stability tricks hold outside DeepSeek's custom hardware and training stack is the question the research community will be stress-testing over the next several months.

Key Takeaways

  • DeepSeek completes a quantization progression — FP16 → FP8 (V3) → FP4 (V4) — with the full paper detailing the QAT methodology, group-wise scaling, and phased training schedule, according to DeepSeek's technical report
  • FP4 weights reduce model memory footprint by approximately 2x relative to FP8 and 4x relative to FP16, enabling significantly higher inference throughput on fixed hardware, as DeepSeek's paper documents
  • Quantization-Aware Training at FP4 precision requires Straight-Through Estimators plus per-group learned scaling factors to avoid catastrophic gradient collapse — the paper specifies the exact group sizes used for attention versus feed-forward expert layers
  • NVIDIA's Blackwell architecture (B100/B200) ships with native FP4 tensor core support, meaning these weight formats now map to hardware-level compute acceleration rather than memory savings alone, per NVIDIA's Blackwell product documentation
  • Benchmark performance on coding tasks reportedly holds within 1% of the BF16 baseline; mathematical reasoning shows a slightly wider gap of 1–2% absolute — consistent with prior quantization literature showing math tasks are more precision-sensitive
  • Per-group quantization is the critical design decision: per-tensor FP4 degrades performance unacceptably for attention-heavy tasks; the paper's ablations show that granularity of scaling factors accounts for most of the recovered accuracy
  • Three stability interventions — mixed-precision BF16 master weights, a phased QAT transition schedule, and per-expert quantization ordering in MoE layers — are documented explicitly and represent engineering decisions not yet standardized in open-source tooling

What the DeepSeek V4 Paper Actually Covers

The title carries a signal: this is the full version, not the model release announcement. Full papers matter for replication — they include the ablations, the failure modes, and the engineering choices that didn't fit into a system card. For anyone trying to reproduce or adapt this work, the distinction between a teaser and a complete technical report is the difference between a description of the destination and a working set of directions.

The core contribution is a validated FP4 QAT recipe for a large MoE model. That sounds incremental if you haven't tried to do it. In practice, getting FP4 training stable — not just functional for a few hundred steps — across hundreds of billions of parameters and dynamic expert routing is the problem that stopped most teams from attempting it seriously before Blackwell hardware made the compute case compelling.

FP4: The Format Choice and What It Costs You

FP4 in this context is E2M1: one sign bit, two exponent bits, one mantissa bit. The representable set is 16 distinct values. For comparison, FP8 E4M3 — the format DeepSeek used for activations in V3 — has a dynamic range roughly three orders of magnitude wider. You're not just losing precision at the edges; you're representing the entire weight distribution with 16 possible values per quantized element before scaling factors are applied.

The case for accepting this cost is straightforward: memory bandwidth. A 671B-parameter MoE model at FP8 occupies roughly 671 GB. At FP4 that halves to around 335 GB. On modern inference hardware where you're moving weights through memory on every forward pass, that reduction directly translates to lower latency and higher tokens-per-second on a fixed server footprint. The computation itself is secondary; the bottleneck at this scale is almost always memory bandwidth.

QAT vs. PTQ: Why This Distinction Matters

Post-Training Quantization takes a trained BF16 or FP16 model and applies quantization after the fact. Calibration data helps, but the model was never trained to work within FP4 constraints. The systematic error introduced by snapping continuous weights to 16-value grids compounds across layers and is particularly damaging for multi-step reasoning tasks.

Quantization-Aware Training inserts simulated quantization into the forward pass during training itself. The Straight-Through Estimator passes gradients through the quantization discontinuity as if it weren't there — a known approximation, but one that holds empirically at the precisions used here. The model progressively learns weight distributions that are still expressive after FP4 snapping. The SmoothQuant line of research established related ideas for activation-weight quantization difficulty redistribution; DeepSeek V4 extends QAT into the training loop rather than applying it post-hoc.

The key practical implication: community checkpoints that appear on public model hubs within days of a model release are PTQ. They use GPTQ, AWQ, or GGUF recipes. These are not what the paper benchmarks. Comparing them to DeepSeek's reported QAT results is comparing different things.

The Stability Tricks That Make FP4 Training Viable

Three interventions in the paper deserve close attention from anyone planning to replicate or adapt this work:

Mixed-precision BF16 master weights. The training runs with FP4 weights for forward computation but maintains BF16 master copies for optimizer updates. This isn't conceptually new — mixed-precision FP16 training has used this pattern since 2018. What's non-trivial is re-tuning loss scaling and determining which layers hold what precision at which training phase for a 670B-parameter MoE.

Per-group scaling. A single scale factor per weight tensor fails immediately at FP4 for large matrices. The V4 paper reports using group size 16 for attention weight matrices and group size 32 for feed-forward weights in the MoE expert layers. This granularity is what gives each group of weights a local scale factor that captures the distribution of that specific group rather than the entire tensor. The paper's ablations show the accuracy impact of increasing group size — the degradation curve is steep, and this is where most of the difference between "FP4 works" and "FP4 is broken" lives.

Phased QAT schedule. The model doesn't begin with FP4 weights. It starts in higher precision, transitions through a warmup phase, and reaches full FP4 forward passes after weight distributions have stabilized. Expert weights in the MoE feed-forward layers are quantized first — they represent the bulk of the parameter count but are individually less critical per token than shared attention layers. Attention weights are quantized later in training. Rushing this schedule causes instability in the routing logits; the MoE gate activations become noisy, and the training loss diverges rather than converging.

Benchmark Evidence: What the Paper Reports

The benchmarks follow a consistent pattern across the results tables. On coding tasks — HumanEval and SWE-Bench variants — FP4 QAT stays within approximately 1% absolute of the BF16 baseline. Mathematical reasoning benchmarks (MATH, AIME-style evaluations) show a slightly larger gap in the 1–2% absolute range. General knowledge and reasoning evaluations (MMLU, GPQA) show differences within measurement noise.

Here's the honest read: FP4 QAT recovers most of what naive PTQ would lose. It does not match BF16 exactly, and the math/reasoning gap is consistent with what the broader quantization literature has documented. Whether 1–2% on a MATH benchmark matters for your deployment depends entirely on your application. The paper's selected benchmarks are not your task distribution.

Evaluation CategoryFP4 QAT vs. BF16 BaselineFP8 PTQ vs. BF16 BaselineKey Sensitivity
Code generation (HumanEval)~1% gap~0.5% gapLow
Mathematical reasoning (MATH)~1–2% gap~1% gapMedium-high
General reasoning (MMLU)Within noiseWithin noiseLow
Long-context tasksUntested publiclyVariesUnknown
Structured outputNot benchmarkedNot benchmarkedLikely medium

What This Changes for AI Engineers, ML Researchers, and Technical Evaluators

Checklist: Evaluating FP4 Models Before You Commit

  • Confirm your hardware generation. FP4 tensor core support exists on Blackwell (B100, B200) and newer. On Hopper (H100/H200) and Ampere, FP4 weights reduce memory but require runtime dequantization — no native compute acceleration.
  • Run your task distribution, not the paper's benchmarks. Pull the FP4 and FP8 checkpoints, run them on 200–500 representative examples from your use case, and measure the delta before assuming the paper's numbers apply.
  • Check the quantization config. Group size 16 and group size 128 produce meaningfully different accuracy at FP4. The model card and quantization config JSON should specify this. If they don't, treat the checkpoint as unverified.
  • Monitor expert utilization in MoE serving. FP4 quantization can subtly degrade routing logit quality, leading to expert load imbalance. Check utilization histograms for the first few days of production; expert collapse shows up as throughput degradation rather than accuracy failure.
  • Establish a PTQ baseline before investing in QAT replication. If PTQ at FP4 is within 1% of QAT for your specific task, you don't need access to DeepSeek's training infrastructure to get most of the benefit.
  • Model the bandwidth gain, not just the memory gain. FP4 wins compound with batch size. At batch size 1 (low-latency single-request serving), the gain is primarily memory. At batch sizes ≥ 8 on Blackwell, you start seeing the tensor core advantage in actual throughput numbers.

For teams building inference infrastructure that feeds into agentic pipelines, the cost modeling here is directly relevant — the relationship between inference efficiency and viable agent call budgets is a concrete engineering constraint, not an abstract optimization.

When NOT to Use FP4 Quantization

Don't deploy FP4 on Hopper or Ampere hardware expecting compute gains. Without native FP4 tensor cores, you're doing FP4 storage with BF16 or FP8 computation, which adds dequantization overhead. Your latency will be comparable to FP8 or worse on those architectures.

Don't fine-tune an FP4 QAT checkpoint in full precision and re-deploy it as FP4. Full-precision fine-tuning shifts the weight distributions that QAT trained to be FP4-compatible. The resulting checkpoint needs re-quantization, and few teams have tooling for QAT re-quantization post-fine-tuning. Using standard GPTQ or AWQ PTQ on a fine-tuned model will produce different results from the paper's reported numbers.

Don't conflate the paper's QAT results with any publicly available community checkpoint. This point is worth repeating. The FP4 QAT results in the paper reflect a specific training run with specific scaling factors. Community quantizations are PTQ. The accuracy gap between them is real and application-dependent.

Don't skip per-task evaluation for applications where distribution tails matter. Legal reasoning, clinical summarization, and structured output generation are all disproportionately affected by low-precision rounding at the token-probability level. Aggregate benchmark numbers mask tail degradation. Test your 5th percentile worst-case outputs, not just your average.

Where This Is Heading

FP4 becomes the default serving precision for frontier models on new hardware within 18 months. The three obstacles — methodology, hardware, and ecosystem tooling — are now simultaneously resolving. DeepSeek's paper closes the methodology gap; Blackwell shipments close the hardware gap; the tooling ecosystem will follow because the incentives are clear.

The model card metadata problem gets worse before it gets better. With FP4, FP8, INT4, GPTQ, AWQ, and GGUF all in active use, the precision metadata on public model repositories is inconsistent. Engineers evaluating performance now need to track not just "which model" but "which quantization recipe, which group size, which calibration dataset" — information that is frequently missing, mislabeled, or misunderstood by the person publishing the checkpoint.

MoE and quantization co-design is an underexplored research area. The V4 paper documents per-expert quantization scheduling for one architecture. The deeper question — how expert specialization evolves under quantization-induced weight noise across training — isn't settled. We don't know whether FP4 QAT preserves or degrades expert diversity in ways that don't show up in standard benchmarks.

Smaller labs will close the replication gap faster than expected. A full technical disclosure means teams with Blackwell access can adapt this methodology without reverse-engineering it. Expect open-source FP4 QAT tooling to mature quickly over the next two quarters, with ablation results that either confirm or challenge DeepSeek's specific recipe.

Hardware-model co-design will tighten further. FP4 QAT in V4 was developed with Blackwell's compute primitives as a design constraint. The trajectory points toward model architectures being specified in terms of quantization compatibility from pre-training — not as a post-hoc deployment optimization but as a first-class training objective.

FAQ

Does FP4 QAT outperform FP8 PTQ in practice? On robust tasks like code generation and general QA, the delta is small. The QAT training lets the model adapt to FP4 constraints, so it partially closes the gap that PTQ would open. On math-heavy or precision-sensitive tasks, FP8 PTQ typically still outperforms FP4 QAT by a modest margin. The paper's coding benchmarks support parity; the math benchmarks show FP4 QAT trailing slightly.

If I download a community FP4 checkpoint, am I getting the QAT version the paper describes? Almost certainly not. Community quantizations released shortly after a model drop are PTQ, using GPTQ, AWQ, or GGUF Q4 recipes. The paper's performance numbers do not apply to those checkpoints. Check the model card for explicit confirmation that QAT was used and what group sizes were applied.

What hardware do I actually need to realize the full benefit? Full computational benefit requires Blackwell (B100, B200, or consumer Blackwell). On Hopper (H100/H200), FP4 weights reduce memory bandwidth requirements but don't run on native FP4 tensor cores — you get the memory savings without the compute acceleration.

How replicable is the stability methodology outside DeepSeek's infrastructure? The paper is a full technical disclosure with ablations, but the phased QAT schedule is calibrated to their specific architecture and hardware stack. Applying it to a different MoE configuration without running the group-size and scheduling ablations the paper describes is likely to produce instability, particularly during the expert weight quantization transition phase.

How does the V4 approach differ from what the DeepSeek-V3 paper documented? V3 used FP8 for activations and certain weight matrices during training, with BF16 master weights for optimizer stability. V4 extends the precision frontier: FP4 for the bulk of weight storage at inference time, with QAT to recover accuracy. V3's FP8 training infrastructure was a prerequisite — the V4 paper explicitly builds on it.

Does this apply to dense models, or only MoE architectures? The core FP4 QAT methodology — mixed-precision master weights, per-group scaling, phased schedule — transfers to dense transformers with modifications. The MoE-specific contribution is the per-expert ordering of quantization phases. Dense models lose that phasing trick and need alternative warm-up strategies.

What should practitioners do with this paper today? Read the group-size ablations and the phased schedule description carefully. If you're serving on Blackwell, benchmark the official FP4 checkpoint against your task distribution before deploying. If you're on Hopper or older, FP8 or AWQ INT4 remains your practical path — the V4 paper doesn't change that short-term. And treat any community FP4 checkpoint as PTQ until the model card explicitly states otherwise.

M
> AI Systems & Technology Editor I started writing code when I was 14 and never fully stopped, even after I began writing about it. Since 2015 I'm dedicated to AI research, and earned my PHD in Computer Science with a thesis on Optimization and Stability in Non-Convex Learning Systems. I've read more technical papers than you can imagine, played with hundreds of tools and currently have a huge local set up where I am having fun deploying and testing models.