AIML

From Pilot to Production

February 2026 —- Chandra Pendyala

Abstract: Do autonomous agents work? Yes — that question is settled. A better question is how well, and under what conditions. The most important question is how to create enterprise value today, not after AGI. This paper answers all three. We quantify where agents succeed and where they structurally fail; map the engineering conditions that determine which; derive ten architectural patterns from production experience; and ground the analysis in a practitioner account of what large-scale enterprise deployment actually looks like. The conclusion: the gap between capability and production reliability is an architecture problem, and it is solvable now.

The Wrong Question

The first question most enterprises ask about agentic AI is whether it works. The demos are impressive. The vendor claims are aggressive. The board is asking. So: does it work?

It works. That question is settled. Frontier models now score 93.8% on PhD-level chemistry and biology questions, substantially exceeding the expert human baseline of 69.7%.[5][10] On abstract fluid reasoning — tasks specifically designed to resist memorisation — Gemini 3 Deep Think reached 84.6%, above the human average of 60%.[4][5] These are not demo results. They are independently verified benchmarks on tasks that require genuine reasoning. The capability is real.

The question ‘does it work?’ generates a confident yes and tells you almost nothing useful. It is the wrong question.

A Better Question: How Well?

A more useful question is how well agents perform — not on curated benchmarks but on the enterprise tasks that actually need automating.

Enterprise Task Performance

TheAgentCompany (CMU, December 2024) evaluated frontier models on 175 professional enterprise tasks in a realistic simulated company environment: project management, HR workflows, financial analysis, software engineering.[12] ARC-AGI-2 bench mark is also relevant to this question

Table 1: Abstract Reasoning Benchmarks (ARC-AGI-2)

System / ResultScoreCost/TaskContext
Gemini 3 Deep Think 84.6%~$30+Feb 2026
Claude Opus 4.6 Thinking68.8%~$60Feb 2026
GPT-5.2 Thinking52.9%est. $20+Dec 2025
o3 high compute75.7%$20,000ARC Prize
Human Average~60%$5Chollet

Sources: ARC Prize Foundation [4]; Google DeepMind [5]; Chollet et al. [6]

Enterprise task performance tells a different story. TheAgentCompany (CMU, December 2024) evaluated 12 frontier models on 175 professional workflows in a realistic simulated company environment — not abstract reasoning, but the actual tasks that need automating.

Table 2: Enterprise Task Performance (TheAgentCompany)

System / ResultScoreCost/TaskContext
Gemini 2.5 Pro 30.3%>$4
TheAgentCompany
Gemini 2.5 Pro — with partial credit39.3%>$4TheAgentCompany
Average across 12 models 16%>$4TheAgentCompany
Average across 12 models — with partial credit25%>$4TheAgentCompany

Source: Xu et al. (CMU), arXiv:2412.14161 [12]

The Planning Constraint

Kambhampati et al. (ICML 2024, spotlight) established the empirical baseline: in autonomous scenarios, approximately 12% of plans generated by GPT-4 are executable without errors.[11] When task names are changed to arbitrary labels — the ‘Mystery Blocksworld’ test — performance collapses to near zero, while standard AI planners are unaffected. Models are performing pattern retrieval from training data, not principled reasoning. The autoregressive substrate does not implement explicit symbolic reasoning; reasoning-like behaviour must be externally scaffolded.

Three findings from the same research define the engineering constraint:

  • Pattern retrieval, not principled reasoning. Performance is a function of training coverage. Architecture has to compensate for this structurally.
  • Self-verification does not compound. LLM verification performance is no better than generation performance on the same tasks. The design implication: verification requires an external deterministic layer.
  • System 1 substrate. The autoregressive substrate is fast, pattern-matching, intuitive — a giant pseudo System 1 operating in a domain that requires System 2: slow, deliberate, verifiable logic. The architectural response is to provide that scaffolding explicitly.

The Cost Dimension

The cost picture defines the economic envelope. The human baseline on ARC-AGI-2 is $5/task.[7] o3 at high compute reached $20,000/task.[6] The AI-to-human cost ratio ranges from 4× to 4,000× depending on configuration. High-reliability agentic reasoning is economically rational when human labour cost substantially exceeds AI inference cost at acceptable accuracy, volume amortises infrastructure, and narrow scope and deterministic verification contain error costs. Outside that envelope, the economics break before the capability does.

The reliability mechanisms that produce better outputs — Tree of Thoughts search, planner-verifier loops, refinement cycles — multiply token consumption. Tree of Thoughts requires 5–100× more tokens than Chain-of-Thought.[13] Topology selection is a precision instrument, not a quality dial.

‘How well?’ gets an honest answer: very well on knowledge retrieval and structured reasoning in bounded domains; unreliably on multi-step autonomous execution; expensively at high-reliability configurations. Better than ‘does it work?’ — but still not the question that drives architecture.

The Right Question: Under What Conditions?

The question that actually drives deployment decisions: under what conditions do agents create reliable value, and under what conditions do they fail? This is an engineering question, not a capability question. The answer is mappable — and mapping it precisely is what separates the deployments that deliver from the 40% that get cancelled.[1]

Condition 1: Data Architecture

Agents fail not because models are weak but because retrieval pipelines return incoherent or contradictory inputs. The symptom is confident, fluent, wrong outputs — indistinguishable from correct outputs without external verification. Industry surveys estimate 70–85% of AI project failures trace to data architecture problems, not model limitations.[18] Only 12% of organisations report data of sufficient quality and accessibility for AI deployment.[16]

Data quality problems are invisible at the model level. The diagnostic is always upstream. Agents create value when retrieval is clean; they fabricate confidently when it isn’t.

Condition 2: Problem Class Match

Agrawal, Gans, and Goldfarb’s economic framework establishes the boundary:[15] probabilistic reasoning adds value when exceptions genuinely outnumber clean cases, input is irreducibly unstructured, or goals require genuine negotiation. When judgment is already encodable — when someone has already written the adjudication manual, when legal maintains the decision tree, when the process exists and is just not yet encoded — deterministic implementation dominates on speed, cost, auditability, and reliability.

Gartner identifies the mismatch directly: agents are non-deterministic by nature while enterprise platforms are deterministic — “the primary reason projects get canceled.”[1] Deploying probabilistic systems against fully specified deterministic problems is the single most common architectural mistake.

There is a second equally important class: tasks where the output is a human-consumed artefact and the consumer is a domain expert — legal analysis, medical differential diagnosis, strategic options synthesis. Here the expert is the verifier. A flawed or partial LLM output is still value-adding because it compresses the time to a good human judgement. The failure mode and the correct architecture are both different. Conflating the two classes produces failures in both directions.

Condition 3: Task Horizon and Context Fidelity

Transformer architectures prioritise recent context. In long-horizon workflows, agents progressively underweight constraints established early in the sequence — instructions at step 1 of a 25-step workflow are structurally underweighted by step 20. This is not a prompt engineering problem; it is a consequence of attention mechanics. A single monolithic agent accumulating context across a complex task will structurally drift from its initial constraints. Agents create value in bounded, well-scoped tasks; they drift in long-horizon ones unless architecture explicitly contains the drift.

Condition 4: Integration Surface

Modern agentic systems require real-time, event-driven APIs. Legacy enterprise systems — IBM z14, AS/400, SAP ECC — operate on batch cycles with no event listeners or execution endpoints. Agent workflows without transactional integrity leave systems in corrupted intermediate states when steps fail mid-sequence. The saga pattern from distributed systems solves this: each action has a compensating transaction, rollback is automatic, state consistency is guaranteed. This is a solved problem that needs to be deliberately applied.

Condition 5: Scope, Governance, and Economics

Autonomous task completion rates drop sharply as scope increases — consistency rates on repeated runs are substantially lower than headline benchmark scores.[22] Agents acting across the enterprise technology stack without governed identity, access control, and behavioural audit trails represent an unbounded risk surface in regulated industries. High-reliability agentic reasoning is economically rational only when human labour cost substantially exceeds AI inference cost at acceptable accuracy, volume is sufficient to amortise infrastructure, and scope and verification can contain error costs.

The deployment gap is not a capability story. The gap between benchmark performance and production reliability is structural — a consequence of how the technology works and what enterprise environments require. It is bridgeable by architecture, not by waiting for better models.

The Most Important Question: Value Today

Mapping the conditions is the analysis. The question that earns the engagement: given these conditions, how do we build systems that create measurable enterprise value now — not after AGI, not after the next model release, but with what exists today?

The answer is architectural. The deployments that deliver share a consistent structural signature: the LLM is used as a generative encoder or idea generator, not as an autonomous planner or executor. Deterministic structures carry the reliability burden. Scope is narrow and expands only as demonstrated reliability justifies it.

What the Research Establishes

Kambhampati et al.’s LLM-Modulo framework pairs LLMs with external sound verifiers: the LLM proposes candidate plans, the verifier evaluates against formal correctness criteria, back-prompting guides the next proposal.[11] Applied to travel planning, this improved performance 6× over LLM-only baselines. The insight: verification is often easier than generation. An LLM that cannot reliably generate a correct plan on the first attempt may generate correct plans within a small number of attempts when given deterministic feedback.

The Autonomous Trustworthy Agents architecture decouples language understanding from logical execution: Phase 1, the LLM translates informal specifications to First-Order Logic; Phase 2, a symbolic reasoning engine executes deterministically. Result: deterministic execution semantics given a validated formal encoding, reportedly outperforming larger reasoning-only models by more than 10 percentage points. LLMFP extends this further: the LLM formulates planning problems as mathematical optimisation code; a deterministic solver executes. Applied to complex logistics, 83.7–86.8% optimal rate.

Infrastructure standards — MCP for tool interoperability, A2A for inter-agent communication — reduce integration friction and enable pattern portability. They are necessary conditions for deployment. The ten architectural patterns that follow are the sufficient conditions for reliability.

Critical design constraint: The LLM-Modulo and ATA approaches only work when the verifier is genuinely deterministic — a formal planner, constraint solver, unit test suite, or schema validator. Replacing deterministic verifiers with LLM critique agents because they are easier to implement reproduces the failure mode at higher cost. LLM verification performance is empirically no better than LLM generation performance on the same tasks.

Ten Architectural Patterns for Enterprise Value Today

Each pattern below is an engineering decision with a specific good design and a named anti-pattern — the shortcut that looks equivalent but isn’t.

Pattern 1  Scope-Bounded Agent
Engineering decision: How do you make an agent’s behaviour governable, testable, and auditable?Good design: Define the agent’s complete action vocabulary as an enumerated set at deployment time. Define authorised input types, authorised state mutations, and escalation paths for out-of-scope inputs. Verify exhaustively that no execution path produces actions outside the defined vocabulary.Anti-pattern: Giving an agent ‘access to the CRM’ without specifying which objects, operations, and time ranges are in scope.
Pattern 2  Deterministic Envelope
Engineering decision: How do you ensure that probabilistic reasoning produces deterministic system behaviour?Good design: A control layer validates all inputs before the agent processes them and all outputs before they produce effects, using deterministic logic for both. Input envelope validates schema and authorised scope. Output envelope validates against expected schema, range, and policy constraints, and emits a structured audit log before allowing downstream effects. Neither envelope uses LLM judgment.Key principle: “Shift the reliability burden from the probabilistic LLM to deterministic system design.”[21]Anti-pattern: Using the LLM itself to check its own output for policy compliance.
Pattern 3  Confidence-Stratified Routing
Engineering decision: How do you route tasks to the right level of autonomy without over-automating complex decisions or under-automating simple ones?Good design: Define routing tiers based on task type, output verifiability, and reversibility of effects. Tier-1 (verifiable, reversible): full autonomy. Tier-2 (partially verifiable, mixed reversibility): human-in-the-loop confirmation. Tier-3 (unverifiable, irreversible): human-only. Instrument routing decisions for ongoing calibration.Extension: A Router Agent assesses query complexity and selects between a large reasoning model for planning and a smaller, faster model for tool execution.Anti-pattern: Using a single confidence threshold without stratifying by consequence severity.
Pattern 4  Transactional State Management (SagaLLM Pattern)
Engineering decision: How do you guarantee state consistency across multi-step agent workflows that may fail at any step?Good design: Maintain an append-only event log for all agent-initiated state changes. Define compensating transactions for each action type before deployment. If any step in a saga fails, execute compensating transactions in reverse order to restore global consistency. Test rollback procedures prior to production deployment.Anti-pattern: Deploying agents with write access to production systems without a tested compensating transaction for every action type.
Pattern 5  Verifier-Gated Execution (LLM-Modulo Pattern)
Engineering decision: How do you convert LLM generative capability into reliable executable outputs?Good design: Separate generation from execution. The LLM proposes; a domain-specific verifier evaluates using deterministic logic — formal planning validators, constraint checkers, unit tests, schema validators. Only verified proposals proceed. Back-prompting from the verifier guides the LLM’s next proposal.Extension: Where the domain has mathematical structure, replace the LLM planner with LLM-as-encoder + deterministic solver (LLMFP pattern).Anti-pattern: Using the LLM to verify its own outputs — empirically no better than unverified generation.
Pattern 6  Neuro-Symbolic Knowledge Decoupling (ATA Pattern)
Engineering decision: How do you extract maximum value from LLM language understanding while delivering deterministic execution guarantees?Good design: Phase 1 — LLM translates informal specifications to formal representations (First-Order Logic, constraint programs). Phase 2 — symbolic engine executes deterministically. The LLM never executes; it encodes. The symbolic engine never interprets language; it executes formal logic.Result: Deterministic execution semantics given a validated formal encoding. Formal correctness guarantees conditional on encoding soundness and fully specified inputs. Elimination of context-narrowing drift.Anti-pattern: Treating LLM output as executable logic rather than as a formal encoding requiring independent validation.
Pattern 7  Data as the Binding Constraint
Engineering decision: How do you sequence data and agent development so they reinforce rather than race each other?What we found: Data and agent development co-evolve. The agent is the best diagnostic tool for data quality problems — deploy it narrow and early, watch it fail, and it surfaces the exact gaps, inconsistencies, and coverage holes faster than any upfront audit. The data work must be treated as the primary workload, not the secondary one.Good design: Define measurable data quality thresholds for each data source. Deploy the agent in narrow, low-stakes scope early — not to prove capability, but to surface data failures. Feed those failures back into the data architecture. Gate production deployment on demonstrated retrieval reliability at acceptable accuracy on real traffic, not on synthetic test sets.Anti-pattern: Treating data quality as a prerequisite gate you pass once, or as a parallel workstream that will be ready when the agent is. Either assumption produces the 95%.
Pattern 8  Agent Identity Principal
Engineering decision: How do you make agent behaviour auditable, attributable, and controllable at enterprise scale?Good design: Assign each agent instance a cryptographically unique identity. Define access scopes using existing RBAC/ABAC infrastructure. Log all agent actions with agent identity, timestamp, input summary, and output summary. Provision with minimum required access scope; require explicit approval for expansion.Anti-pattern: Provisioning agents with a shared service account with broad read/write access.
Pattern 9  Narrow-First Expansion
Engineering decision: How do you expand agent scope reliably without accumulating hidden risk that surfaces as production failures?Good design: Begin with the narrowest viable task scope. Define reliability thresholds — consistency rate, error rate, escalation rate — that must be demonstrated over a defined operational period before scope expansion. Treat expansion as a product release with review, testing, and sign-off.Anti-pattern: Granting full autonomy at deployment and reactively restricting scope after production failures.
Pattern 10  Context-Bounded Multi-Agent Coordination (MACI Pattern)
Engineering decision: How do you maintain constraint fidelity across long-horizon multi-step workflows where a single agent accumulating context will structurally drift?Good design: Decompose complex tasks across a meta-planner and a pool of specialised sub-agents, each operating with a restricted context window computed for the sub-task at hand — sized to the minimum context required to hold the relevant constraints without drift, determined empirically per problem domain. There is no universal constant; the bound is an engineering parameter you derive, not inherit. The meta-planner holds global task state in a deterministic structured object (JSON or equivalent) — not in LLM memory. Each sub-agent receives a deterministically constructed payload: a typed, code-generated view of that state, never an LLM-generated summary. Sub-agents write structured results back to the state store; the meta-planner reads updated state to determine next dispatch.Critical constraint: Allowing the meta-planner to summarise global state in natural language before dispatch reintroduces lossy probabilistic compression into the coordination layer. The same constraint dropped from a long context will be dropped from an LLM summary. Deterministic state construction is what makes context restriction actually work.Anti-pattern: LLM-generated natural language summaries as sub-agent dispatch payloads. Preserves the failure mode the pattern was designed to eliminate.
Pattern 11  Layered Memory as Attention Bypass – General Attention optimization in Long Contexts.
Engineering decision: How do you maintain factual continuity and constraint fidelity across sessions when attention mechanics structurally underweight early sequence information?Good design: Do not extend context. Bypass attention for all but the immediate working set. Implement three disjoint memory layers with a background consolidation process. The LLM attends to nothing except working memory; everything else is storage.
Working memory holds current session state, tool outputs, and the immediate plan — fully attended, kept minimal by design. Episodic memory holds past sessions, retrieved by identifier; full transcripts are never reloaded, only flagged segments surface when indexed identifiers signal relevance. Semantic memory holds project facts, user preferences, and architectural decisions in topic-split files, pulled only when the index flags them as relevant to the current task. Consolidation runs as a background sub-agent during idle time — merge, prune, upgrade — triggered by elapsed time or session count. It is not an online process.
Three constraints are non-negotiable. Memory is a hint, not ground truth: always reverify against the live environment before acting on retrieved state. The model cannot write to any memory layer without passing through a deterministic verification step. Consolidation is asynchronous and cannot guarantee consistency within an active session.Critical constraint: Keep attended context small enough that attention actually works. The rest is storage, not context. Extension: MCP tool descriptions beyond 10% of window moved to a search tool, discovered on demand rather than preloaded (Tool Deferral); large tool results written to file paths rather than returned raw (File Reference Offloading); meta-planner state passed as typed JSON, never as LLM-generated summaries (Deterministic State Transfer); 95% of traffic routed to a fast retrieval model with minimal context, frontier LLM reserved for ambiguous cases (Outcome-Based Routing).Anti-pattern: Treating long context as a memory solution. Loading full session transcripts or raw tool outputs into context. Maintaining a single monolithic memory file. Each scales the buffer where signal decays. None solves the attention problem.

What This Looks Like in Practice

The following section documents what we built deploying enterprise reasoning systems under production conditions — deadlines, budget pressure, sceptical stakeholders, and engineers who needed something that actually worked. The academic literature is cited where it confirmed what we already found in the field.

The Threshold Question: Where Reasoning Earns Its Place

The capability is real. The excitement is justified. The job of the architect is to put that capability exactly where it delivers — and build everything around it so the delivery actually lands in production.

The clearest analogy is ML-based image intelligence. Before it crossed a threshold, practitioners wrote geometric rules — edge detection thresholds, colour space conditions, shape templates. The threshold wasn’t crossed because ML became fashionable. It was crossed when the variation space genuinely exceeded what rules could cover, and ML became the superior tool for that specific class of problem. Knowing when the threshold is crossed — and deploying the right tool at the right moment — is the craft.

The same threshold question applies to reasoning in enterprise workflows. The variation space in accounts payable, claims adjudication, and HR workflows is large but enumerable — the adjudication manual exists, the decision tree exists, the domain experts know how to do these tasks. What we found, repeatedly: the most valuable first step was identifying exactly where reasoning was the superior tool, deploying it precisely there, and building deterministic structure around it so it delivered reliably. Get that right, and the system works. Deploy reasoning everywhere because it’s exciting, and you join the 40%.

What We Built

The systems we delivered combine the natural language intelligence of frontier models with the reliability guarantees of deterministic engineering. More precise, more layered, and considerably harder to build than vendor demos suggest. Also what works in production at enterprise scale.

We start at inference time, not training time. The first question is whether intent classification to an enumerated list can handle the variation space. We prototype the intent classifier, run it against real traffic samples, and see how far it gets. In most cases further than expected — because the enumerated intent list forces the discipline of specifying what the system is supposed to do, surfacing requirements that had been implicit and unlocking domain expertise sitting in people’s heads. When the inference-time classifier works, we build the RL training dataset from its demonstrated correct behaviour, then distil to a student model for cost. The result: a system routing the vast majority of production traffic through a fast, cheap, highly reliable retrieval model, reserving frontier reasoning for the cases that genuinely require it.

We build the three-tier confidence stack. Clean intent match to the student retrieval model. Ambiguous but in-domain to the full LLM. Out-of-distribution to human escalation with a support ticket. This is not a fallback — it is the architecture. It means the system handles easy cases cheaply and quickly, hard cases intelligently, and unknown cases safely. The out-of-distribution detector is a separately maintained component with its own thresholds and test suite — because frontier models will confidently misclassify out-of-distribution inputs, and the system needs an independent layer that catches this before it produces downstream effects.

We make execution agents fully deterministic. Each agent wraps a specific, bounded operation — calls an API, writes a record, triggers a workflow step — and returns a structured result. Auditable, testable, replaceable. All coordination logic lives in the orchestrator: what to call, in what order, with what inputs, what to do on failure. The coordination logic is designed, reviewed, and tested — it does not emerge from agent reasoning at runtime.

The orchestrator lives outside the MCP universe. This is a trust boundary, not a convenience decision. The orchestrator is a deterministic state machine receiving structured outputs from agents through a typed interface. Reasoning cannot propagate upward from the execution layer into the coordination layer — which means a single agent behaving unexpectedly cannot corrupt the workflow.

The Verification Trap

When deterministic verification takes real engineering effort — schema validators, constraint checkers, formal correctness criteria — the tempting shortcut is to use another LLM to review the first LLM’s output. A critique agent. A judge model. A self-reflection loop. Faster to build, easier to demonstrate, looks like rigour.

In our deployments, multi-LLM critique loops increased token cost without materially improving verified correctness rates. The reason is structural: a reasoning model reviewing a reasoning model’s output shares the same goal and the same failure modes. Under extended reasoning budget, the reviewer finds ways to ratify the output rather than reject it — precisely what Kambhampati et al. established empirically.[11]

The correct response is deterministic structure. Schema validators, constraint checkers, unit tests, formal correctness criteria. If the domain is too complex to write a deterministic verifier, that’s diagnostic: the specification work hasn’t been done yet. We do that work.

The architectural distinction that matters is not ‘deterministic versus probabilistic’ but ‘human-consumed artefact versus system-executed action.’ The verification trap only exists in the latter class.

When the Human Is the Verifier

The execution architecture above is built for workflows where the output is a system-executed action and correctness is formally definable. It is one of two distinct deployment architectures we use.

The second class: tasks where the output is a human-consumed artefact and the consumer is a domain expert. Legal analysis, medical differential diagnosis, strategic options synthesis, complex contract review. Here the expert is the verifier. The LLM’s job is to compress the time to a good human judgement — producing a richer, faster starting point than the expert could generate alone. A flawed or partial output is still value-adding because a trained expert will catch and correct it, and arrives at their judgement faster.

In this class, multi-pass LLM review and critique agents are appropriate tools — not as an autonomous reliability mechanism but as a way of surfacing alternative framings, stress-testing assumptions, and giving the expert more angles to work from. A senior lawyer reviewing a multi-pass LLM analysis of a complex contract is operating at a higher level than one reviewing a single-pass output. That’s genuine value delivered.

The craft is knowing which class you’re in and building accordingly.

What Delivers

The architecture we build confines frontier reasoning to exactly the places where it earns its place: the input boundary, where natural language is mapped to structured intent, and the domains where genuine variation complexity or irreducible judgment requires it. Everything else is deterministic, auditable, and cost-predictable.

What this delivers: systems that handle enterprise-scale traffic reliably, expand scope as demonstrated performance justifies it, give compliance and audit teams what they need, and improve as the underlying models improve — because the architecture is designed for capability to flow in, not to be locked out.

The teams that built production image intelligence didn’t fight the threshold — they understood it, deployed precisely at it, and built systems that scaled as capability grew. The clients who hired them got systems that worked. The capability is real, the excitement is justified, and the architecture is what turns excitement into delivery.

References

[1]  Verma, A. (Gartner). “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027.” Gartner Newsroom, June 2025.  https://www.gartner.com/en/newsroom/press-releases/2025-06-2…

[2]  S&P Global Market Intelligence. “Voice of the Enterprise: AI & Machine Learning, Use Cases 2025.” S&P Global MI, May 2025.  https://www.spglobal.com/market-intelligence/en/news-insight…

[3]  Challapally, A., Pease, C., Raskar, R., & Chari, P. (MIT NANDA). “The GenAI Divide: State of AI in Business 2025.” MIT Project NANDA, July 2025.  https://www.artificialintelligence-news.com/wp-content/uploa…

[4]  ARC Prize Foundation. “ARC Prize 2025 Results and Analysis.” arcprize.org, 2025.  https://arcprize.org/blog/arc-prize-2025-results-analysis

[5]  Google DeepMind. “Gemini 3 Deep Think.” Google DeepMind Blog, February 2026.  https://blog.google/innovation-and-ai/models-and-research/ge…

[6]  Chollet, F. et al.. “ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems.” arXiv:2505.11831, 2025.  https://arxiv.org/abs/2505.11831

[7]  Chollet, F.. “OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.” ARC Prize Foundation, December 2024.  https://arcprize.org/blog/oai-o3-pub-breakthrough

[8]  Anonymous. “OpenAI’s o3 Is Not AGI.” arXiv:2501.07458, 2025.  https://arxiv.org/pdf/2501.07458

[9]  OpenAI. “Introducing GPT-5.2.” openai.com, December 2025.  https://openai.com/index/introducing-gpt-5-2/

[10]  Rein, D. et al.. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” arXiv:2311.12022, 2023.  https://arxiv.org/abs/2311.12022

[11]  Kambhampati, S. et al.. “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” ICML 2024. arXiv:2402.01817, 2024.  https://arxiv.org/abs/2402.01817

[12]  Xu, F. et al. (CMU). “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks.” arXiv:2412.14161, December 2024.  https://arxiv.org/abs/2412.14161

[13]  Yao, S. et al.. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. arXiv:2305.10601, 2023.  https://arxiv.org/abs/2305.10601

[14]  ARC Prize Foundation. “ARC-AGI Leaderboard: Performance vs. Cost Per Task.” arcprize.org, 2025.  https://arcprize.org/leaderboard

[15]  Agrawal, A., Gans, J., & Goldfarb, A.. “Exploring the Impact of AI: Prediction versus Judgment.” Tech Policy Institute, 2018.  https://techpolicyinstitute.org/wp-content/uploads/2018/02/G…

[16]  Precisely / Drexel LeBow. “2025 Outlook: Data Integrity Trends and Insights.” Precisely, September 2024.  https://www.precisely.com/press-release/new-global-research-…

[17]  Precisely / Drexel LeBow. “2026 State of Data Integrity and AI Readiness.” Precisely/Drexel LeBow, January 2026.  https://www.lebow.drexel.edu/sites/default/files/2026-01/leb…

[18]  Unstructured.io. “The Rise of the Agentic Enterprise.” Unstructured.io, 2025.  https://unstructured.io/blog/the-rise-of-the-agentic-enterpr…

[19]  OpenAI. “GPT-5 System Card.” OpenAI, 2025.  https://cdn.openai.com/gpt-5-system-card.pdf

[20]  Google Cloud. “AI Grew Up and Got a Job: Lessons from 2025 on Agents and Trust.” Google Cloud Blog, 2025.  https://cloud.google.com/transform/ai-grew-up-and-got-a-job-…

[21]  METR. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” arXiv:2507.09089, July 2025.  https://arxiv.org/abs/2507.09089

[22]  Kapoor, S. et al.. “AI Agents That Matter.” TMLR. arXiv:2407.01502, 2024.  https://arxiv.org/abs/2407.01502

All statistics sourced from primary or peer-reviewed sources. Claims supported only by industry surveys are noted as such. Working paper — February 2026.