GLM-5 Benchmarks 2026: SWE-bench and BrowseComp Results

TL;DR: Z.ai’s GLM-5 is a 744B parameter open-weights model (40B active via MoE) that scores 77.8% on SWE-bench Verified — the highest of any open-weights model — and 62.0 on BrowseComp, nearly doubling Claude Opus 4.5’s 37.0. The benchmark positioning is strong. The deployment math is more complicated.

Key Takeaways

GLM-5 scores 77.8% on SWE-bench Verified — the highest open-weights result to date on software engineering task completion.
BrowseComp score of 62.0 vs Claude Opus 4.5’s 37.0 — a 25-point gap on multi-step web research, which is harder to attribute to benchmark gaming.
744B total parameters, 40B active via sparse MoE — inference cost is closer to a 40B dense model, but hosting requires ~1.5TB VRAM.
AIME 2026 I: 92.7 vs Claude Opus 4.5’s 93.3 — essentially tied on frontier math reasoning.
First open-weights model to exceed 50 on the Artificial Analysis Intelligence Index v4.0.
Failure modes: long-context degradation patterns uncharacterized at production scale; expert utilization distribution not publicly profiled.

GLM-5 Benchmarks: What the Open-Source 744B Model Scores on SWE-bench and BrowseComp

Z.ai released GLM-5 in March 2026 with benchmark results that shift the open vs. closed-source comparison in a concrete direction. On SWE-bench Verified, it scores 77.8% — the highest number posted by any open-weights model on software engineering task completion. On BrowseComp, a multi-step web research benchmark, it scores 62.0. Claude Opus 4.5 scores 37.0 on the same benchmark.

Those numbers need context before they mean anything actionable. GLM-5 is a Mixture-of-Experts model. Total parameters: 744B. Active parameters per forward pass: approximately 40B. Running GLM-5 requires infrastructure to host 744B parameters, but the inference compute per step is closer to a 40B dense model. That distinction matters for anyone evaluating deployment costs.

This post works through the benchmark results, the architecture choices behind them, and the parts of the evaluation story that the launch didn’t lead with.

Architecture: 744B Parameters, 40B Active

GLM-5 uses a sparse Mixture-of-Experts architecture. Of the 744B total parameters, only approximately 40B are activated per token during inference. The model was pretrained on 28.5T tokens — roughly double Llama 3 70B’s 15T token training set.

The MoE design has a direct engineering consequence: you need the VRAM and storage to host all 744B parameters (approximately 1.5TB in bfloat16), but each forward pass only touches 40B of them. On a cluster running 8x H100s (640GB total VRAM), GLM-5 fits with careful sharding — but it is not a single-GPU deployment.

Z.ai reports pretraining stability across the full 28.5T token run. What “stability” means operationally — whether they observed loss spikes, gradient norm excursions, or checkpoint failures — is not detailed in the public release documentation. That is a gap in the transparency record worth noting.

MoE routing at 744B/40B scale introduces load imbalance risk. If certain experts are consistently over-selected during fine-tuning or inference, throughput degrades and the active-parameter advantage erodes. No public characterization of GLM-5’s expert utilization distribution has appeared yet.

SWE-bench Verified at 77.8%

SWE-bench Verified tests an agent’s ability to resolve real GitHub issues from open-source repositories. The benchmark has 500 verified human-evaluated instances. A score of 77.8% means the model resolved 389 of 500 issues, as evaluated by the benchmark harness.

Context on where that number sits: GPT-4o scored 48.9% on SWE-bench Verified when the benchmark launched. The previous open-weights high-water mark was in the high 50s to low 60s. GLM-5’s 77.8% is the highest single-model open-weights result on record for this benchmark.

The benchmark caveat that always applies to SWE-bench: performance varies significantly by repository. Models score higher on popular repos (Django, Flask, scikit-learn) with extensive training representation and lower on niche or newer codebases. A 77.8% aggregate masks per-repo variance that matters for production deployment decisions. Test GLM-5 on your specific codebase before treating the aggregate number as a signal.

BrowseComp: The More Informative Result

BrowseComp is a multi-step web research benchmark developed by OpenAI. It requires a model to answer questions by navigating multiple web sources and synthesizing information across them — closer to what agentic retrieval tasks actually involve than single-hop QA.

GLM-5 scores 62.0. Claude Opus 4.5 scores 37.0. That 25-point gap is the most striking result in the GLM-5 release for anyone building retrieval or research agents.

What makes this harder to dismiss as benchmark gaming: BrowseComp questions are designed to resist lookup. They require multi-hop inference across sources that individually don’t contain the answer. A 62.0 vs. 37.0 gap at this benchmark design suggests a structural difference in multi-step reasoning, not just marginal improvement.

The question the BrowseComp number doesn’t answer: how does GLM-5 perform when retrieval context is synthetic or the sources are adversarial? Production web research pipelines encounter both. The benchmark score is on curated evaluation tasks. Real-world BrowseComp-style performance under distribution shift is uncharacterized.

Intelligence Index and Math Reasoning

The Artificial Analysis Intelligence Index v4.0 aggregates performance across coding, math, reasoning, and instruction following into a single composite score. GLM-5 is the first open-weights model to exceed 50 on this index, up 8 points from GLM-4.7.

On AIME 2026 I (math competition problems), GLM-5 scores 92.7. Claude Opus 4.5 scores 93.3. That is a 0.6-point difference on a benchmark where the previous open-source frontier was in the 60s. Frontier math reasoning parity between an open-weights model and a top closed model represents a different competitive position than the gap that existed 18 months ago.

One number not in the launch material: GLM-5’s performance on instruction following at extended context. The model supports a 128K context window. How reliably it retrieves and reasons over content distributed across that window — as opposed to the first and last few thousand tokens — has not been characterized in detail by third parties.

Inference Cost Reality Check

Open-source does not mean cheap to run at scale. GLM-5 at full precision (bfloat16) requires approximately 1.5TB of GPU VRAM — sharded across multiple high-end GPUs. On a p5.48xlarge (8x H100 SXM), running at approximately $98/hour on AWS on-demand, intermittent evaluation is tractable. A production API serving thousands of requests per day requires the math to work at your actual volume.

Quantized GGUF versions (Q4_K_M) reduce the memory footprint substantially, with roughly 4–6% quality degradation on most benchmarks. For local inference research on a multi-GPU workstation, quantized GLM-5 is viable. This works well for most evaluation scenarios, though production deployments at scale need a careful cost-per-token comparison against closed API pricing before committing.

Model	SWE-bench Verified	BrowseComp	AIME 2026 I	Deployment type
GLM-5 (744B MoE)	77.8%	62.0	92.7	Self-hosted, multi-GPU
Claude Opus 4.5	~75% range	37.0	93.3	API (closed)
GPT-4o	48.9%	—	—	API (closed)
Previous open-weights SOTA	~60s	~20s	~60s	Self-hosted

Is Your Infrastructure Ready for GLM-5?

☐ You have access to 8x H100 GPUs (640GB+ total VRAM) for full-precision hosting
☐ Your use case is SWE-bench-style coding or multi-step research — not general chat
☐ You’ve verified your specific repository or domain is represented in the test distribution
☐ You have a plan for quantized evaluation before committing to full-precision deployment
☐ Long-context reliability is not a hard requirement for your application
☐ You’ve read the current license terms on the HuggingFace repository before shipping commercially
☐ You have a cost-per-token comparison against closed API pricing at your expected volume

When You Should NOT Use GLM-5

If your infrastructure cannot handle a multi-GPU deployment, don’t start with GLM-5 for evaluation. Trying to run a 744B MoE model on insufficient hardware produces degraded results that don’t reflect the benchmark numbers — and leads to inaccurate capability assessments that cost time and budget.

It’s also not the right choice if long-context reliability has to be characterized before deployment. The 128K context window is supported, but GLM-5’s behavior across the full window hasn’t been tested publicly beyond first-party evaluations. For applications where consistent retrieval across 100K+ token contexts is critical, wait for third-party long-context evaluations before deploying.

If you need a well-documented production model with characterized failure modes, the closed-source options remain better documented for enterprise use. GLM-5’s SWE-bench score is the highest open-weights result on record, but “highest open-weights score” and “best choice for your production coding agent” are not the same claim.

FAQ

Is GLM-5 fully open-source?

GLM-5 is open weights — the model weights are publicly available on HuggingFace. The training code and full data composition are not published. “Open weights” and “open source” are different designations; GLM-5 is the former. Verify licensing terms before commercial deployment.

Can GLM-5 run on a single machine?

At full precision (bfloat16), GLM-5 requires approximately 1.5TB of GPU VRAM — beyond any single consumer or prosumer setup. With Q4 quantization, it can run on a machine with 256GB+ VRAM across multiple GPUs. Single-machine deployment requires high-end server hardware.

How does GLM-5’s BrowseComp score compare to GPT-5.x models?

Z.ai’s release compares GLM-5 against Claude Opus 4.5 (37.0). GPT-5.4 BrowseComp scores have not been publicly published by OpenAI as of March 2026, so a direct three-way comparison is not possible from available data.

What license covers GLM-5 for commercial use?

Z.ai has not published comprehensive commercial license terms for all use cases. Before deploying GLM-5 in a commercial product at scale, check the current license on the official HuggingFace repository for GLM-5 — terms may have been updated since release.

Conclusion: Next Steps

GLM-5’s benchmark results are the clearest evidence to date that open-weights models can reach closed-model performance on structured coding and multi-step research tasks. The SWE-bench Verified score of 77.8% and BrowseComp score of 62.0 are not incremental improvements over prior open-source baselines.

What remains uncharacterized: long-context reliability under distribution shift, expert utilization patterns under production load, and per-repository performance on codebases outside the SWE-bench training distribution. Those gaps will take months of third-party evaluation to fill in.

If you’re evaluating GLM-5, start with a quantized version on the tasks closest to your actual use case. Run the SWE-bench evaluation on your own repository — not the aggregate — before making deployment decisions. And verify the current license terms before shipping anything commercial. Test the long-context retrieval path at your target window size before deploying anything that depends on it.