BestAIFor.com
AI

ARC-AGI-3 Benchmark Results: Every Frontier Model Scores Below 1%

M
Matthieu Morel
April 3, 202610 min read
Share:
ARC-AGI-3 Benchmark Results: Every Frontier Model Scores Below 1%

TL;DR: The ARC Prize Foundation released ARC-AGI-3 on March 26, 2026 — an interactive, fully agentic benchmark for general AI reasoning. Every frontier model scored below 1%: Gemini 3.1 Pro Preview reached 0.37%, GPT-5.4 reached 0.26%, Claude Opus 4.6 reached 0.25%, and Grok-4.20 scored 0.00%. Humans with no domain training score 100%. The gap is structural, pointing at a specific capability — goal-directed exploration under novel task structures — that current scaling hasn’t addressed.

Key Takeaways

  • ARC-AGI-3 requires active, multi-step exploration — models must form hypotheses, act on environments, and verify iteratively
  • Best frontier model score: 0.37% (Gemini 3.1 Pro Preview); GPT-5.4: 0.26%; Claude Opus 4.6: 0.25%; Grok-4.20: 0.00%
  • Human baseline with no domain training: 100% — the gap is not about knowledge retrieval
  • ARC-AGI-2 was saturated within 18 months; ARC-AGI-3 uses procedural generation to resist saturation
  • $2M prize remains unclaimed; the benchmark is publicly accessible
  • These scores don’t indicate the models are broken — they indicate a specific gap in open-ended exploration that doesn’t respond to scale alone

ARC-AGI-3 Benchmark Results: Every Frontier Model Scores Below 1%

The gap between humans and frontier models on ARC-AGI-3 is 99.63 percentage points. That’s the result of running every major frontier model against the ARC Prize Foundation’s March 26, 2026 release. Gemini 3.1 Pro Preview scored highest at 0.37%. Grok-4.20 scored 0.00%. GPT-5.4 and Claude Opus 4.6 both landed in the 0.25–0.26% range. Untrained humans score 100%.

This is the first iteration of ARC-AGI designed to be fully interactive and agentic. Prior versions used static input-output tasks. ARC-AGI-3 presents tasks that require models to explore an environment, form and test hypotheses, and verify conclusions across multiple sequential steps. No single-shot reasoning. No pattern matching against a training distribution. The model has to figure out what the task even is before it can begin solving it.

The low scores are not a surprise to anyone who’s been tracking the benchmark’s design intent. They are, however, a clear signal about where current model capabilities stop — and where they don’t respond to more compute or more parameters.

What ARC-AGI-3 Tests — And Why Prior Benchmarks Got Saturated

ARC-AGI-2, released in 2025, was solved at a competitive level by frontier models within roughly 18 months. The benchmark used visual analogy tasks with fixed inputs — once a model learned the underlying pattern types, scaling helped close the gap. GPT-5.4 and Gemini 3.0 were both scoring above 80% on ARC-AGI-2 by late 2025.

ARC-AGI-3 makes saturation structurally harder. Tasks are procedurally generated, meaning the specific task instances a model encounters during evaluation have never existed before. There’s no static test set to overfit on. The benchmark is interactive: each task unfolds across multiple turns, where the model must observe consequences of its actions and update its approach. This is closer to how humans solve novel problems than any benchmark released to date at this scale.

The evaluation protocol requires models to operate as agents — not as single-call inference endpoints. That distinction eliminates the largest class of benchmark gaming: brute-forcing answers through parallel inference over a fixed task set. The $2M prize structure reinforces this — you can’t engineer a solution by training on the test set when the test set doesn’t exist until evaluation time.

Score Breakdown: What 0.37% Means in Practice

To put 0.37% in context: across the full ARC-AGI-3 task set, Gemini 3.1 Pro Preview correctly completed approximately 3–4 tasks out of every 1,000 attempted. GPT-5.4 at 0.26% completed roughly 2–3. Claude Opus 4.6 at 0.25% is statistically indistinguishable from GPT-5.4. Grok-4.20 at 0.00% returned no correct completions in the evaluation window.

The performance distribution is essentially flat across models. A 0.12 percentage point spread separates the best result from the third-best. That compresses what looks like a meaningful gap into noise at this scale. None of these models is demonstrably better at this task class than the others — they’re all failing in the same way.

What’s failing: the exploration-verification loop. Models that score well on static benchmarks use strong pattern recognition to match inputs to outputs. ARC-AGI-3 tasks start with no recoverable pattern — the model has to construct the pattern through active probing. Most frontier models attempt single-shot reasoning on the first turn and fail to recover when that fails. The iterative recovery behavior humans apply automatically isn’t present in current architectures.

Why Current Scaling Doesn’t Close This Gap

The standard expectation in LLM development is that more compute and more parameters improve benchmark scores. That held across MMLU, GSM8K, HumanEval, MATH, and most benchmarks that drove scaling law papers. It held for ARC-AGI-2 — GPT-4 class models progressed steadily as size increased.

ARC-AGI-3 breaks that pattern. The capability it tests — forming hypotheses about novel task structure, acting on them, and revising under feedback — doesn’t appear to be stored in weights the way factual recall or formal reasoning is. It requires something closer to runtime metacognition: monitoring whether your current approach is working and shifting strategies mid-task. Pre-training on more text or code doesn’t directly train this loop.

This is consistent with the finding that humans with no relevant domain knowledge score 100%. A person who’s never done visual analogy tasks will still approach a novel ARC-AGI-3 task by exploring, guessing, checking, and adjusting. That general-purpose exploration behavior doesn’t depend on training data about specific task types. Current frontier models don’t appear to have an equivalent default behavior.

ARC-AGI-3 vs. Active Evaluation Benchmarks

BenchmarkTask TypeBest Frontier ScoreHuman BaselineSaturation Risk
ARC-AGI-3Interactive agentic, procedurally generated0.37% (Gemini 3.1 Pro Preview)100%Low — no static test set
ARC-AGI-2Static visual analogy~85% (GPT-5.4, late 2025)~97%Already near-saturated
SWE-bench VerifiedReal GitHub issue resolution~72% (Claude Opus 4.6)Not directly comparableModerate — grows with new issues
GPQA DiamondExpert-level Q&A, static~89% (Gemini 3.1 Pro Preview)~65% (domain experts)High — fixed question set
BrowseCompMulti-step web research~51% (GPT-5.4)~78%Moderate

ARC-AGI-3 is the only benchmark in this table where the human-model gap runs in the wrong direction by this magnitude. On GPQA Diamond, frontier models exceed the domain expert human baseline. On ARC-AGI-3, the human baseline is 271 times the best model score. These aren’t measuring the same thing, and conflating them distorts how you evaluate model selection.

What These Results Should Change About Benchmark Interpretation

High scores on static benchmarks no longer imply general reasoning competence. A model scoring 89% on GPQA Diamond has demonstrated strong pattern matching on a fixed question distribution — not the ability to explore novel task structures. ARC-AGI-3 makes that distinction concrete with numbers.

Agentic performance evaluation needs to be interactive. Running an LLM through a static benchmark in single-call mode doesn’t capture how the model performs in workflows where it takes sequential actions and recovers from errors. ARC-AGI-3’s interactive design is a closer proxy for production agentic behavior. This works well for most evaluation contexts, though teams building narrow task-specific agents may find the benchmark less predictive of their specific use case than a domain-matched evaluation.

The benchmark is public. The ARC Prize Foundation provides evaluation access and the full scoring protocol. If you’re building systems that require genuine novel problem-solving under uncertainty, running your model against ARC-AGI-3 gives you a number. That’s more useful than extrapolating from MMLU scores.

When You Should NOT Use ARC-AGI-3 Scores for Model Selection

ARC-AGI-3 measures a specific capability: goal-directed exploration under novel, procedurally generated task structures. It does not measure performance on the tasks most production applications run. If you’re selecting a model for code generation, document summarization, structured data extraction, or instruction following, ARC-AGI-3 scores are not the relevant signal. A model scoring 0.25% on ARC-AGI-3 can still be the correct choice for your deployment.

Don’t use ARC-AGI-3 as a general-purpose model ranking. The 0.12 percentage point spread between GPT-5.4 and Gemini 3.1 Pro Preview is not evidence of a meaningful performance difference for any practical deployment task. Use task-matched evaluations for task-specific selection.

Also avoid treating the 100% human baseline as evidence that humans are generally more capable than frontier models. Humans outperform current models on open-ended exploration tasks. Models outperform humans on GPQA Diamond, formal math, and most code generation benchmarks. The comparison is task-specific.

Decision Checklist: Should ARC-AGI-3 Change How You Evaluate Models?

  • ☐ You are building agentic systems where the model needs to explore, act, and revise behavior mid-task
  • ☐ You rely on static benchmarks (MMLU, GPQA, HumanEval) as proxies for general reasoning — consider adding interactive evaluation
  • ☐ You track benchmark scores to inform base model selection for agent workflows
  • ☐ You want a benchmark that can’t be saturated by training on the test set
  • ☐ You are evaluating whether your model can handle genuinely novel task structures without prior exposure

FAQ

Why did Grok-4.20 score 0.00%?

The ARC Prize Foundation hasn’t released per-model diagnostic breakdowns. A 0.00% score doesn’t mean the model failed every task — it means no completions were verified correct within the evaluation protocol. Whether this reflects a systematic capability gap or a configuration issue for that specific model is unclear from the public data.

Can you fine-tune a model specifically for ARC-AGI-3?

Procedural generation makes this structurally hard. Task instances generated during evaluation have never existed before, so you can’t build a training set from the benchmark. You can train on the task generation logic, but that requires reverse-engineering the procedural system, which the ARC Prize Foundation hasn’t fully published.

Is ARC-AGI-3 a valid test for production AI systems?

As a ceiling test for open-ended exploration capability, yes. As a proxy for your specific use case, likely not. ARC-AGI-3 is best interpreted as a stress test for a specific capability cluster — useful for agentic pipeline design, less useful for selecting a model for narrow document workflows.

What’s the $2M prize structure?

The full $2M prize goes to any model or system achieving human-level performance — matching the 100% human baseline — under the evaluation protocol. Partial prizes have been offered in prior ARC iterations for hitting specific score thresholds. Prize terms are published at arcprize.org.

How does ARC-AGI-3 differ from the original ARC Challenge released in 2019?

The 2019 ARC Challenge used static, human-drawn visual analogy tasks. ARC-AGI-3 adds procedural task generation (no static test set) and interactive multi-turn evaluation. Both changes target the same failure mode: models that score high by learning the test distribution rather than developing general problem-solving behavior.

Conclusion: Next Steps

ARC-AGI-3 establishes a clear empirical floor for what “general AI reasoning” currently means: the best frontier models solve roughly 3–4 tasks per 1,000 on a benchmark that untrained humans complete at 100%. That gap is real, large, and not closing through current scaling approaches.

The benchmark is public. Before concluding that your model or agent framework handles novel problem-solving, run it against the ARC-AGI-3 task set and get a number. The specific edge case to test before relying on any agentic system for high-stakes tasks: strategy recovery. What happens when the model’s first approach fails and it needs to shift to a different hypothesis mid-task? That’s where ARC-AGI-3 failures concentrate, and it’s the behavior worth stress-testing before production deployment.

M
> AI Systems & Technology Editor I started writing code when I was 14 and never fully stopped, even after I began writing about it. Since 2015 I'm dedicated to AI research, and earned my PHD in Computer Science with a thesis on Optimization and Stability in Non-Convex Learning Systems. I've read more technical papers than you can imagine, played with hundreds of tools and currently have a huge local set up where I am having fun deploying and testing models.

Related Articles