ChatGPT vs DeepSeek: Which Free AI for Beginners is Smarter?
ChatGPT and DeepSeek are two leading free AIs for beginners. This guide compares their features, writing skills, and ease of use to help you choose.

TL;DR: The ARC Prize Foundation released ARC-AGI-3 on March 26, 2026 — an interactive, fully agentic benchmark for general AI reasoning. Every frontier model scored below 1%: Gemini 3.1 Pro Preview reached 0.37%, GPT-5.4 reached 0.26%, Claude Opus 4.6 reached 0.25%, and Grok-4.20 scored 0.00%. Humans with no domain training score 100%. The gap is structural, pointing at a specific capability — goal-directed exploration under novel task structures — that current scaling hasn’t addressed.
Key Takeaways
The gap between humans and frontier models on ARC-AGI-3 is 99.63 percentage points. That’s the result of running every major frontier model against the ARC Prize Foundation’s March 26, 2026 release. Gemini 3.1 Pro Preview scored highest at 0.37%. Grok-4.20 scored 0.00%. GPT-5.4 and Claude Opus 4.6 both landed in the 0.25–0.26% range. Untrained humans score 100%.
This is the first iteration of ARC-AGI designed to be fully interactive and agentic. Prior versions used static input-output tasks. ARC-AGI-3 presents tasks that require models to explore an environment, form and test hypotheses, and verify conclusions across multiple sequential steps. No single-shot reasoning. No pattern matching against a training distribution. The model has to figure out what the task even is before it can begin solving it.
The low scores are not a surprise to anyone who’s been tracking the benchmark’s design intent. They are, however, a clear signal about where current model capabilities stop — and where they don’t respond to more compute or more parameters.
ARC-AGI-2, released in 2025, was solved at a competitive level by frontier models within roughly 18 months. The benchmark used visual analogy tasks with fixed inputs — once a model learned the underlying pattern types, scaling helped close the gap. GPT-5.4 and Gemini 3.0 were both scoring above 80% on ARC-AGI-2 by late 2025.
ARC-AGI-3 makes saturation structurally harder. Tasks are procedurally generated, meaning the specific task instances a model encounters during evaluation have never existed before. There’s no static test set to overfit on. The benchmark is interactive: each task unfolds across multiple turns, where the model must observe consequences of its actions and update its approach. This is closer to how humans solve novel problems than any benchmark released to date at this scale.
The evaluation protocol requires models to operate as agents — not as single-call inference endpoints. That distinction eliminates the largest class of benchmark gaming: brute-forcing answers through parallel inference over a fixed task set. The $2M prize structure reinforces this — you can’t engineer a solution by training on the test set when the test set doesn’t exist until evaluation time.
To put 0.37% in context: across the full ARC-AGI-3 task set, Gemini 3.1 Pro Preview correctly completed approximately 3–4 tasks out of every 1,000 attempted. GPT-5.4 at 0.26% completed roughly 2–3. Claude Opus 4.6 at 0.25% is statistically indistinguishable from GPT-5.4. Grok-4.20 at 0.00% returned no correct completions in the evaluation window.
The performance distribution is essentially flat across models. A 0.12 percentage point spread separates the best result from the third-best. That compresses what looks like a meaningful gap into noise at this scale. None of these models is demonstrably better at this task class than the others — they’re all failing in the same way.
What’s failing: the exploration-verification loop. Models that score well on static benchmarks use strong pattern recognition to match inputs to outputs. ARC-AGI-3 tasks start with no recoverable pattern — the model has to construct the pattern through active probing. Most frontier models attempt single-shot reasoning on the first turn and fail to recover when that fails. The iterative recovery behavior humans apply automatically isn’t present in current architectures.
The standard expectation in LLM development is that more compute and more parameters improve benchmark scores. That held across MMLU, GSM8K, HumanEval, MATH, and most benchmarks that drove scaling law papers. It held for ARC-AGI-2 — GPT-4 class models progressed steadily as size increased.
ARC-AGI-3 breaks that pattern. The capability it tests — forming hypotheses about novel task structure, acting on them, and revising under feedback — doesn’t appear to be stored in weights the way factual recall or formal reasoning is. It requires something closer to runtime metacognition: monitoring whether your current approach is working and shifting strategies mid-task. Pre-training on more text or code doesn’t directly train this loop.
This is consistent with the finding that humans with no relevant domain knowledge score 100%. A person who’s never done visual analogy tasks will still approach a novel ARC-AGI-3 task by exploring, guessing, checking, and adjusting. That general-purpose exploration behavior doesn’t depend on training data about specific task types. Current frontier models don’t appear to have an equivalent default behavior.
| Benchmark | Task Type | Best Frontier Score | Human Baseline | Saturation Risk |
|---|---|---|---|---|
| ARC-AGI-3 | Interactive agentic, procedurally generated | 0.37% (Gemini 3.1 Pro Preview) | 100% | Low — no static test set |
| ARC-AGI-2 | Static visual analogy | ~85% (GPT-5.4, late 2025) | ~97% | Already near-saturated |
| SWE-bench Verified | Real GitHub issue resolution | ~72% (Claude Opus 4.6) | Not directly comparable | Moderate — grows with new issues |
| GPQA Diamond | Expert-level Q&A, static | ~89% (Gemini 3.1 Pro Preview) | ~65% (domain experts) | High — fixed question set |
| BrowseComp | Multi-step web research | ~51% (GPT-5.4) | ~78% | Moderate |
ARC-AGI-3 is the only benchmark in this table where the human-model gap runs in the wrong direction by this magnitude. On GPQA Diamond, frontier models exceed the domain expert human baseline. On ARC-AGI-3, the human baseline is 271 times the best model score. These aren’t measuring the same thing, and conflating them distorts how you evaluate model selection.
High scores on static benchmarks no longer imply general reasoning competence. A model scoring 89% on GPQA Diamond has demonstrated strong pattern matching on a fixed question distribution — not the ability to explore novel task structures. ARC-AGI-3 makes that distinction concrete with numbers.
Agentic performance evaluation needs to be interactive. Running an LLM through a static benchmark in single-call mode doesn’t capture how the model performs in workflows where it takes sequential actions and recovers from errors. ARC-AGI-3’s interactive design is a closer proxy for production agentic behavior. This works well for most evaluation contexts, though teams building narrow task-specific agents may find the benchmark less predictive of their specific use case than a domain-matched evaluation.
The benchmark is public. The ARC Prize Foundation provides evaluation access and the full scoring protocol. If you’re building systems that require genuine novel problem-solving under uncertainty, running your model against ARC-AGI-3 gives you a number. That’s more useful than extrapolating from MMLU scores.
ARC-AGI-3 measures a specific capability: goal-directed exploration under novel, procedurally generated task structures. It does not measure performance on the tasks most production applications run. If you’re selecting a model for code generation, document summarization, structured data extraction, or instruction following, ARC-AGI-3 scores are not the relevant signal. A model scoring 0.25% on ARC-AGI-3 can still be the correct choice for your deployment.
Don’t use ARC-AGI-3 as a general-purpose model ranking. The 0.12 percentage point spread between GPT-5.4 and Gemini 3.1 Pro Preview is not evidence of a meaningful performance difference for any practical deployment task. Use task-matched evaluations for task-specific selection.
Also avoid treating the 100% human baseline as evidence that humans are generally more capable than frontier models. Humans outperform current models on open-ended exploration tasks. Models outperform humans on GPQA Diamond, formal math, and most code generation benchmarks. The comparison is task-specific.
The ARC Prize Foundation hasn’t released per-model diagnostic breakdowns. A 0.00% score doesn’t mean the model failed every task — it means no completions were verified correct within the evaluation protocol. Whether this reflects a systematic capability gap or a configuration issue for that specific model is unclear from the public data.
Procedural generation makes this structurally hard. Task instances generated during evaluation have never existed before, so you can’t build a training set from the benchmark. You can train on the task generation logic, but that requires reverse-engineering the procedural system, which the ARC Prize Foundation hasn’t fully published.
As a ceiling test for open-ended exploration capability, yes. As a proxy for your specific use case, likely not. ARC-AGI-3 is best interpreted as a stress test for a specific capability cluster — useful for agentic pipeline design, less useful for selecting a model for narrow document workflows.
The full $2M prize goes to any model or system achieving human-level performance — matching the 100% human baseline — under the evaluation protocol. Partial prizes have been offered in prior ARC iterations for hitting specific score thresholds. Prize terms are published at arcprize.org.
The 2019 ARC Challenge used static, human-drawn visual analogy tasks. ARC-AGI-3 adds procedural task generation (no static test set) and interactive multi-turn evaluation. Both changes target the same failure mode: models that score high by learning the test distribution rather than developing general problem-solving behavior.
ARC-AGI-3 establishes a clear empirical floor for what “general AI reasoning” currently means: the best frontier models solve roughly 3–4 tasks per 1,000 on a benchmark that untrained humans complete at 100%. That gap is real, large, and not closing through current scaling approaches.
The benchmark is public. Before concluding that your model or agent framework handles novel problem-solving, run it against the ARC-AGI-3 task set and get a number. The specific edge case to test before relying on any agentic system for high-stakes tasks: strategy recovery. What happens when the model’s first approach fails and it needs to shift to a different hypothesis mid-task? That’s where ARC-AGI-3 failures concentrate, and it’s the behavior worth stress-testing before production deployment.
ChatGPT and DeepSeek are two leading free AIs for beginners. This guide compares their features, writing skills, and ease of use to help you choose.
Master advanced prompting techniques 2026 like Chain-of-Thought and Self-Ask to get better results from ChatGPT, Grok, and Gemini.
An accessible overview of the history of artificial intelligence, from early theoretical ideas to modern deep learning.
In 2026, mainstream content creators and new AI adopters have powerful AI video tools at their fingertips.
Discover the most powerful AI productivity tools for 2026, including Gemini, Claude, and top emerging alternatives.
China LLMs 2026: Qwen vs DeepSeek vs ERNIE vs Hunyuan Compared
Machine learning vs deep learning explained with clear differences, real world use cases, and guidance for beginners and professionals
Stop collecting AI tools. Start building a system that works like a fractional employee-automate smarter, not harder.
Explore 12 hands-on AI for Students hacks in 2026—from flashcard tutors to auto-lit reviews—to boost focus, save time, and learn smarter
Muck Rack's 2026 journalism survey found 82% of journalists use AI, up from 77%. But concern about unchecked AI rose 8 points to 26%. Here is what the numbers mean for editorial teams.
The News/Media Alliance signed a 50/50 AI licensing deal with Bria covering 2,200 publishers on enterprise RAG queries. The split sounds equitable. Bria controls the attribution algorithm.
The Dallas Fed's February 2026 analysis shows entry-level positions fell 16% in top AI-exposed industries while experienced workers' wages rose 16.7%. The split is structural, not temporary.