TL;DR: Claude Sonnet 4.6 scores 51 on the Artificial Analysis Intelligence Index, placing it second overall, tied with GPT-5.2 and two points below Opus 4.6. It generated 74M output tokens to get there: roughly 3x Sonnet 4.5 and 28% more than Opus 4.6. At $15 per million output tokens, that gap has direct cost implications that the ranking alone doesn't surface.

Key Takeaways

Claude Sonnet 4.6 scores 51 on the AAII, an 8-point gain over Sonnet 4.5 (43) and essentially tied with GPT-5.2
It used 74M output tokens during the Intelligence Index evaluation; Sonnet 4.5 used 25M; Opus 4.6 used 58M
HLE (Humanity's Last Exam) alone accounted for 47M of Sonnet 4.6's tokens, 64% of its total output
The verbosity translated to approximately $2,088 in evaluation costs versus $733 for Sonnet 4.5, on identical per-token pricing
Sonnet 4.6 leads Opus 4.6 on agentic sub-benchmarks, though the margin falls within the 95% confidence interval
Intelligence Index scores don't normalize for token consumption. Reading the token data alongside the score is essential for production decisions.

AI Model Benchmarking: What Claude Sonnet 4.6's Token Surge Reveals About the Intelligence Index

AI model benchmarking rarely tells you everything you need to know before a deployment decision. Claude Sonnet 4.6 is a useful case study. It scores 51 on the Artificial Analysis Intelligence Index (AAII), placing it second overall, two points behind Opus 4.6 and essentially tied with GPT-5.2. What the ranking doesn't surface by default: Sonnet 4.6 generated 74M output tokens to achieve that score. Sonnet 4.5 used 25M. At $15 per million output tokens, that difference compounds quickly in any production workload.

This post examines what the AAII data actually shows for Sonnet 4.6: what drives the intelligence gain, why the token count is unusually high, and how to read these results before committing to a model. The goal isn't to rank models. It's to give developers the right frame for interpreting benchmark output.

What Is the Artificial Analysis Intelligence Index?

The AAII is a composite benchmark published by Artificial Analysis. It aggregates model performance across multiple evaluation domains: Humanity's Last Exam (HLE), coding tasks, mathematical reasoning, and agentic assessments. The final score is normalized across these domains rather than pulled from any single dataset, which reduces the risk of models optimizing for one narrow evaluation and looking artificially strong.

Scores run from 0 to 100. As of February 2026, the top cluster is compressed: Claude Opus 4.6 at 53, Claude Sonnet 4.6 at 51, GPT-5.2 at approximately 51. Below that cluster, scores drop sharply. The median for non-reasoning models sits around 19. Adaptive reasoning modes have compressed the apparent intelligence gap between frontier models at different price points, though the compression looks more complete on the index than it does in practice once you account for token consumption.

One structural characteristic of the index: it does not normalize scores by token count. A model that burns 74M output tokens and one that burns 25M tokens receive identical credit if they produce the same score. That's a defensible design choice for pure capability measurement. It's a gap to fill manually when making cost-aware decisions.

How Claude Sonnet 4.6 Ranks and at What Cost

Sonnet 4.6 launched on February 17, 2026. Its 51-point AAII score represents an 8-point improvement over Sonnet 4.5 (43 points), the largest single-generation jump in the Sonnet line since the index launched.

The mechanism is adaptive reasoning: an extended thinking mode that allocates additional compute at inference time before producing a final answer. This is what drives the intelligence gain. It is also the primary driver of the elevated token output. The model thinks longer before it responds, and that thinking is billed as output tokens.

Pricing is $3/M input tokens and $15/M output tokens, identical to Sonnet 4.5. That pricing parity makes the token consumption gap the critical variable in any cost comparison. Two models at the same sticker price but different output volumes are not the same cost. This is the framing most benchmark summaries skip.

The Output Token Gap: 74M Versus 25M

Sonnet 4.6 generated 74M output tokens across the full AAII evaluation suite. HLE alone accounted for 47M of those tokens, 64% of total output. Extended thinking traces scale with task difficulty, and HLE is among the hardest available evaluations. That concentration in one sub-benchmark is not incidental: the model allocates more reasoning compute where the problems are harder.

The comparison figures matter. Sonnet 4.5 used 25M output tokens on the same evaluation. Opus 4.6 in Adaptive Reasoning mode used 58M. Sonnet 4.6 generated roughly 3x the token output of Sonnet 4.5 and 28% more than Opus 4.6, while scoring 2 points below Opus overall.

Output speed compounds the picture. Sonnet 4.6 generates at 57.2 tokens per second on Anthropic's API. The median for comparable reasoning models is 71.5 t/s. A model producing more tokens at lower speed takes substantially longer per response than its index position implies. For batch workloads, this is a throughput constraint. For interactive applications, it affects perceived latency directly.

This works well for reasoning-heavy tasks where quality matters more than speed, though it's a poor fit for real-time applications that need fast incremental output.

What the Token Gap Means for Production Costs

Artificial Analysis calculated the total evaluation cost for Sonnet 4.6 at approximately $2,088. For Sonnet 4.5, the equivalent run cost around $733. That's a 2.8x cost ratio on nominally identical per-token pricing.

That ratio won't replicate for all workloads. Benchmark task distributions are weighted toward the hardest available evaluations, where extended thinking generates the longest reasoning traces. Simpler production tasks will produce shorter traces and lower token counts per request.

The direction, however, is reliable: for reasoning-heavy tasks, expect output token counts substantially above Sonnet 4.5 and above what sticker pricing implies. Before committing to Sonnet 4.6 at volume, run it on a representative sample of your actual inputs. Measure output token counts per task. Calculate cost-per-correct-output at your expected request volume. That number, not the AAII ranking, should drive the budget decision.

Agentic Performance: Where Sonnet 4.6 Leads

The most operationally significant result in the Sonnet 4.6 data isn't the overall AAII score: it's the agentic sub-benchmark results.

On GDPval-AA, a real-world work evaluation, Sonnet 4.6 scores an ELO of 1633 and leads Opus 4.6. On TerminalBench, a terminal task completion benchmark, Sonnet 4.6 scores 53% versus Opus 4.6's 46%. Sonnet 4.6 outperforms the higher-scoring Opus 4.6 on both agentic evaluations, though the GDPval-AA margin falls within the 95% confidence interval, which means the difference may not be statistically decisive at that benchmark scale.

That caveat matters for interpretation. The agentic lead is consistent across two independent evaluations, which provides more confidence than a single result. But treating it as a definitive performance gap rather than a directional signal would be overreading the data.

For developers building coding agents, multi-tool orchestration pipelines, or long-form document analysis workflows, this agentic direction is the most practically relevant signal in the AAII release. Whether it holds on your specific pipeline depends on task type, context length, and how your agent structures tool calls.

AI Model Benchmarking Comparison: Key Metrics

Model	AAII Score	Output Tokens (Eval)	Eval Cost	Output Speed
Claude Opus 4.6	53	58M	—	—
Claude Sonnet 4.6	51	74M	~$2,088	57.2 t/s
GPT-5.2	~51	—	—	—
Claude Sonnet 4.5	43	25M	~$733	—

Source: Artificial Analysis Intelligence Index, February 2026. [Confirmed for Sonnet 4.6; Estimated for Opus 4.6 eval cost; GPT-5.2 score approximate.] Token counts and costs reflect benchmark evaluation conditions, not typical production workloads. Output per request scales with task complexity and reasoning trace length.

Selection Guide: Which Model Fits Which Workload

Use this checklist when deciding between models for a specific workload.

Choose Claude Sonnet 4.6 if:

Your workload is agentic: coding agents, multi-step orchestration, terminal task execution
Tasks require extended reasoning: mathematics, complex planning, difficult classification problems
You've confirmed the higher output token count is acceptable at your request volume and budget
Response latency is not a hard constraint (batch processing or asynchronous workflows)

Choose Claude Opus 4.6 if:

You need the highest AAII score (53 vs. 51) for non-agentic reasoning tasks
Your use case doesn't match the agentic categories where Sonnet 4.6 leads

Choose Claude Sonnet 4.5 or a non-reasoning model if:

Volume is high and cost per output token matters more than per-output quality gains
Latency is a hard constraint for interactive, user-facing applications
Tasks don't require deep reasoning chains that benefit from extended thinking

Best-for summary:

Use case	Recommended model
Complex agentic workflows	Claude Sonnet 4.6
Maximum AAII score, non-agentic tasks	Claude Opus 4.6
High-volume, latency-sensitive applications	Claude Sonnet 4.5 or non-reasoning model
Balanced reasoning quality vs. cost	Evaluate Sonnet 4.6 on your own task sample first

When You Should NOT Use Intelligence Index Rankings to Choose a Model

The AAII gives useful directional signal. It is not a deployment specification.

Scenario 1: Token count mismatches your workload. Benchmark token usage reflects a specific task distribution, weighted toward the hardest available evaluations. If your production tasks are simpler than HLE-level reasoning, Sonnet 4.6's 3x token footprint will not replicate at that ratio. If your tasks are comparably complex, it might exceed it. Run a cost estimate against a representative sample of your real inputs before committing.

Scenario 2: Your primary constraint is latency. Sonnet 4.6 generates output at 57.2 t/s, below the median for comparable reasoning models (71.5 t/s). Combined with elevated per-response token counts, each output takes longer than an equivalent Sonnet 4.5 response. For user-facing assistants with real-time latency requirements, the AAII ranking has no relevance to that constraint.

Scenario 3: You need predictable per-request cost. Extended thinking modes produce variable output lengths across identical prompts as reasoning traces adapt to perceived task difficulty. For systems with tight budget controls or token quotas per request, models with more consistent output length may be easier to manage operationally, even at a lower intelligence score.

FAQ

What is the Artificial Analysis Intelligence Index? A composite benchmark that aggregates model performance across HLE, coding, math, and agentic evaluations, normalized to a 0–100 scale. Published by Artificial Analysis, with rankings updated as new models are evaluated.

Why did Claude Sonnet 4.6 use more output tokens than Opus 4.6? Sonnet 4.6 generated 74M tokens versus Opus 4.6's 58M, a 28% difference on the same evaluation suite. HLE alone accounted for 47M of Sonnet 4.6's total. Why Sonnet generates longer reasoning traces than Opus on identical tasks is not publicly documented by Anthropic.

Does Sonnet 4.6's higher token count mean it performs better than Opus 4.6? Not on overall AAII score. Opus 4.6 scores 2 points higher (53 vs. 51). Sonnet 4.6 does lead on GDPval-AA ELO (1633) and TerminalBench (53% vs. 46%), though the GDPval-AA margin falls within the 95% confidence interval.

What will Claude Sonnet 4.6 actually cost in production? At $15/M output tokens, costs scale directly with output length. The benchmark evaluation ran to approximately $2,088, but that reflects the AAII task distribution. Your costs depend on how often your specific workload triggers extended reasoning chains and how long those chains run.

Is Sonnet 4.6 worth the cost increase over Sonnet 4.5? For reasoning-intensive and agentic workloads, the 8-point AAII gain is real and measurable. For high-volume or latency-sensitive applications, Sonnet 4.5 will typically be more cost-effective. The right test is cost-per-correct-output on your actual task distribution, not the benchmark ratio.

Where can I find the full AAII data and methodology? Scores, token counts, speeds, and per-model cost data are published at artificialanalysis.ai. Rankings update as new models are evaluated and added to the index.

Conclusion: Next Steps

Claude Sonnet 4.6 delivers a measurable intelligence gain in AI model benchmarking: an 8-point AAII improvement over Sonnet 4.5 and a consistent agentic lead over the higher-scoring Opus 4.6. The 74M output token footprint is the number that changes the cost calculation. Same sticker price as Sonnet 4.5, roughly 3x the output token consumption: that ratio must enter any honest cost-benefit analysis before deployment.

For intermediate developers evaluating models for production workloads, run Sonnet 4.6 against a representative sample of your own inputs. Measure output token counts per task. Calculate cost-per-correct-output at your expected volume, and compare that figure against Sonnet 4.5. The index gives you a starting point. Your own data gives you the answer.

New Benchmark Insight Shows High Token Usage by Claude Sonnet 4.6 in AI Intelligence Index