BestAIFor.com
AI News

Nemotron 3 Super vs Qwen 3.5: When Speed and Accuracy Point in Opposite Directions

M
Matthieu Morel
March 19, 202610 min read
Share:
Nemotron 3 Super vs Qwen 3.5: When Speed and Accuracy Point in Opposite Directions

TL;DR: NVIDIA's Nemotron 3 Super and Alibaba's Qwen 3.5 both activate roughly 10 billion parameters per token. In production, Nemotron 3 Super runs about 3x faster. Qwen 3.5 beats it by 16 points on SWE-bench Verified. The gap has almost nothing to do with model size — and a lot to do with architecture choices and software ecosystem maturity.

Key Takeaways

  • Both models activate roughly 10B parameters per token. The throughput gap is architectural, not a size advantage.
  • Nemotron 3 Super's Mamba-2 SSM layers have O(n) sequence complexity vs standard Transformer attention's O(n²). At long output sequences, this compounds with Multi-Token Prediction and native NVFP4 precision to deliver real speed gains.
  • Qwen 3.5 scores 76.4% on SWE-bench Verified (397B flagship) vs Nemotron 3 Super's 60.47%. The accuracy gap matters for high-stakes single-shot tasks.
  • NVIDIA claims 7.5x throughput. Third-party API data puts the real-world gap at 3x–4x for standard workloads.
  • Qwen 3.5 is Apache 2.0. Nemotron 3 Super is not — it uses NVIDIA's own open model license with attribution requirements.
  • Nemotron 3 Super was designed for NVIDIA-native agentic pipelines, not general-purpose deployment. Qwen 3.5 was designed to run anywhere.

Nemotron 3 Super vs Qwen 3.5: When Speed and Accuracy Point in Opposite Directions

NVIDIA released Nemotron 3 Super on March 11, 2026 at GTC. Alibaba had released Qwen 3.5 in February. Both are open-weight, mixture-of-experts models that activate roughly 10 billion parameters per token. On a spec sheet, they occupy the same tier.

In production, one runs about 3x faster. The other repairs code 16 percentage points more accurately on SWE-bench Verified. They are not competing on the same axis — and picking the wrong one for your workload has real cost consequences.

Here's what the benchmarks actually say, why the throughput gap is smaller than NVIDIA's paper claims, and a concrete framework for choosing between them.

The Architecture Difference Driving the Speed Gap

Nemotron 3 Super uses a three-way hybrid architecture: Mamba-2 SSM layers, standard Transformer attention layers, and a "LatentMoE" expert routing system. It is the first production model to interleave all three paradigms in a single forward pass. The SSM layers handle most sequence processing with O(n) linear complexity — vs the O(n²) of standard Transformer attention. At 64,000 output tokens, a typical length for a coding agent generating a large file, this difference compounds fast.

Two factors amplify the speed advantage further. Multi-Token Prediction (MTP) — a form of built-in speculative decoding — generates multiple tokens per forward pass, adding roughly 50% to raw generation speed. NVIDIA also trained the model in NVFP4 precision natively. On Blackwell B200 hardware, NVFP4 runs 4x faster than FP8 on H100. Stack all three advantages and you get NVIDIA's 7.5x benchmark figure. It's real — under those specific conditions.

Qwen 3.5's architecture is also theoretically efficient. It interleaves Gated DeltaNet linear attention layers with standard Transformer attention in a 3:1 ratio, plus 256 routed MoE experts per layer. Gated DeltaNet is linear-time in theory — but as of early 2026, it has no ONNX operator support. Deployments running outside native PyTorch on CUDA decompose each recurrent step into 15–20 primitive operations instead of a single fused kernel. In those configurations, throughput on the DeltaNet layers degrades by 10–50x. This is a software ecosystem gap, not a fundamental architectural weakness — but it's real today, and it's the primary structural reason Nemotron 3 Super is faster in mainstream deployments.

Benchmark Results: What the Numbers Actually Say

Here's a direct comparison across the benchmarks most relevant to production deployment decisions.

BenchmarkNemotron 3 Super (120B)Qwen3.5-122BQwen3.5-397B
SWE-bench Verified60.47%~72–74% (not independently confirmed)76.4%
GPQA (with tools)82.70%88.4%
MMLU-Pro83.73%86%+
HLE (Humanity's Last Exam)18.26%25.30%
RULER at 1M context91.75%
PinchBench (agentic orchestration)85.6% (top open model)
Throughput (real-world API)~458–484 tok/s~152 tok/s

The SWE-bench gap is the most operationally significant number here. A 16-point difference on coding repair tasks translates to a meaningfully higher per-step failure rate. If you're running high-volume agentic pipelines where retries are cheap and fast, that failure rate becomes a cost model question. If you're running single-shot high-stakes evaluation — production code review, security triage — the accuracy gap matters more than the throughput advantage.

Where Nemotron 3 Super leads clearly: long-context performance and agentic orchestration. A 91.75% RULER score at 1 million tokens and an 85.6% PinchBench score represent the strongest results in the open-weight category for multi-step agentic workflows as of March 2026. The 1 million token native context window, built on Mamba-2's linear complexity, is a genuine hardware-efficient advantage for workflows that need it.

The HLE gap is worth flagging separately. Nemotron 3 Super scores 18.26% vs Qwen3.5-397B's 25.30% on Humanity's Last Exam, which tests general scientific breadth across domains. Denser architectures with broader training coverage tend to perform better here — consistent with the pattern across most general reasoning benchmarks.

The Real-World Throughput Gap

NVIDIA's technical report claims 7.5x higher throughput than Qwen3.5-122B. The test configuration: 8,000 input tokens and 64,000 output tokens, on NVIDIA hardware running NVFP4 precision. Under those conditions — long outputs, NVIDIA Blackwell stack, native precision format — the advantage is real.

Third-party data from Artificial Analysis puts the production API gap at roughly 3x: Nemotron 3 Super delivers 458–484 tokens per second; Qwen3.5-122B delivers around 152 tokens per second on Alibaba's API. Still a significant margin — but the 7.5x figure represents peak optimized conditions. Most workloads with shorter output sequences will see something closer to 3x.

One deployment constraint worth noting on the Nemotron side: Mamba-2 SSM kernels also have limited third-party framework support in early 2026. ONNX operators for SSM layers don't exist yet. If your inference stack runs outside NVIDIA NIM containers or native PyTorch on CUDA, you lose the SSM kernel advantage — and you're left with the same software ecosystem friction that affects Qwen's DeltaNet layers. Both models have an 8x H100-80GB self-hosting floor. Neither is accessible on small GPU setups.

Licensing: The Detail Most Teams Skip

Qwen 3.5 is Apache 2.0 across all model sizes. You can modify it, deploy it commercially, build derivative models, and redistribute without attribution requirements.

Nemotron 3 Super uses the NVIDIA Nemotron Open Model License. It's commercially usable and royalty-free, but it carries attribution requirements and NVIDIA-specific safeguard clauses. Calling it "open-source" without qualification is inaccurate. For enterprise teams with legal review processes — particularly those building products that include model redistribution or derivative model training — this difference is non-trivial. Verify with your legal team before committing it to a production product.

How to Choose: A Decision Framework for Builders

The core question is whether you're optimizing for throughput in a high-volume agentic pipeline or for accuracy on individual high-stakes tasks. Here's a practical checklist.

Use Nemotron 3 Super when:

  • ☐ You're running on NVIDIA H100 or B200 infrastructure natively
  • ☐ Your workload generates long output sequences — 20,000+ tokens per agent step
  • ☐ You're building multi-agent pipelines where cost per token is the primary constraint
  • ☐ 1 million token native context is a hard requirement
  • ☐ SWE-bench accuracy in the 60% range is sufficient for your task
  • ☐ NVIDIA Open Model License is acceptable to your legal team

Use Qwen 3.5 when:

  • ☐ Best available accuracy on coding and reasoning is the priority (SWE-bench 76.4%)
  • ☐ Your hardware is non-NVIDIA, or you need infrastructure flexibility
  • ☐ Apache 2.0 licensing is required for redistribution or derivative model training
  • ☐ Multimodal inputs are needed — Qwen 3.5 supports image inputs natively; Nemotron 3 Super does not
  • ☐ Multilingual coverage matters — Qwen 3.5 supports 201 languages
  • ☐ You want thinking and non-thinking mode available in a single model

One nuance worth modeling before committing: in a pipeline where each agent step has a failure mode and you're running 1,000 steps per hour, Nemotron 3 Super's lower SWE-bench score means more retries. If each retry costs roughly the same as an additional inference step, the throughput advantage partially erodes. Calculate cost-per-successful-outcome for your actual workload, not just tokens-per-second.

When You Should NOT Use These Models

Nemotron 3 Super is a poor fit when:

  • Your inference stack runs outside native PyTorch on CUDA or NVIDIA NIM containers. The SSM speed advantage disappears, and you're left with an 8x H100 hardware requirement and none of the throughput benefits.
  • Your use case requires broad general scientific reasoning. An 18.26% HLE score represents real domain coverage gaps outside the model's agentic training focus.
  • Your legal team requires Apache 2.0 for redistribution or derivative model development. The NVIDIA license doesn't cover this.

Qwen 3.5 is a poor fit when:

  • You need maximum throughput on NVIDIA hardware for long-output agentic tasks. Even on native CUDA, Nemotron 3 Super runs ~3x faster in real-world API conditions.
  • You're running non-CUDA deployments and need predictable inference throughput. Gated DeltaNet's kernel gap degrades throughput significantly in those environments until ONNX support catches up.
  • Your accuracy requirement is specifically tied to the 397B tier. The 76.4% SWE-bench figure belongs to the 397B variant — the 122B model likely sits around 72–74%, which is still strong but a different cost and hardware tier.

FAQ

What is the minimum hardware to self-host either model?

Both require a minimum of 8x H100-80GB to self-host. Nemotron 3 Super also supports NVIDIA NIM containers with FP8 weights on H100. Neither runs on small GPU setups — these are enterprise-class deployments.

Does Nemotron 3 Super support image or multimodal inputs?

No. Nemotron 3 Super is text-only as of March 2026. Qwen 3.5 supports image inputs natively across multiple model sizes in the family, including the 122B variant.

Is NVIDIA's 7.5x throughput claim accurate?

It's accurate under specific conditions: 64,000+ output tokens on Blackwell hardware under NVFP4 precision. Third-party API data puts the production gap at 3x–4x for typical workloads with standard output lengths. Significant, but not 7.5x in standard conditions.

Can Qwen 3.5 handle 1 million token context like Nemotron 3 Super?

Qwen 3.5 has a 256K native context window, extensible to roughly 1M. Nemotron 3 Super's 1M context is native and uses Mamba-2's linear complexity, making it more compute-efficient at very long contexts.

Which model is better for software engineering agents?

Qwen 3.5 scores higher on SWE-bench Verified — 76.4% vs 60.47%. For accuracy-first single-shot workflows, Qwen 3.5 is the stronger choice. For high-volume agents where inference cost per step is the constraint, Nemotron 3 Super may reduce overall cost depending on your retry tolerance and hardware stack.

Conclusion: Next Steps

Neither model is the clear winner. NVIDIA built Nemotron 3 Super to demonstrate what their hardware stack enables at scale — the throughput advantage is real on their infrastructure, and it's operationally meaningful for high-volume agentic workflows. Alibaba built Qwen 3.5 to run everywhere, with the strongest SWE-bench score at its parameter class, true Apache 2.0 licensing, and hardware-agnostic architecture.

The practical decision: if you're on NVIDIA-native infrastructure and running throughput-critical agentic pipelines where per-step accuracy in the 60% range is acceptable, Nemotron 3 Super will likely lower your inference cost. In most other cases, Qwen 3.5 is the stronger default. NVIDIA's technical blog covers the full Mamba-2 and MTP architecture design for teams that want to go deeper before committing.

Before committing to either, benchmark both on a sample of your actual workload. Measure cost-per-successful-outcome — not tokens-per-second. That number will tell you which model actually fits your pipeline.

Tags:
AI News
M
> AI Systems & Technology Editor I started writing code when I was 14 and never fully stopped, even after I began writing about it. Since 2015 I'm dedicated to AI research, and earned my PHD in Computer Science with a thesis on Optimization and Stability in Non-Convex Learning Systems. I've read more technical papers than you can imagine, played with hundreds of tools and currently have a huge local set up where I am having fun deploying and testing models.

Related Articles