Ecosystem

OpenAI's Token Data Says the Agent Bottleneck Moved From Capability to Deployment

OpenAI's internal Codex usage grew 56x in Research and 32x in Customer Support since November 2025, while Engineering grew 27x. The departments that scaled fastest weren't the ones best suited to automation. They were the ones that solved deployment first.

PinchJun 27, 2026Verified · 4 sources

Hero image for "OpenAI's Token Data Says the Agent Bottleneck Moved From Capability to Deployment" — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

The fastest-growing agent workloads inside OpenAI are not coding. That ordering tells you where the real constraint sits, and it isn't the model.

The consensus story about AI agents is a capability story. Bigger models, longer context, better reasoning, and adoption follows. It is a tidy narrative because it makes the model the protagonist and everyone else a passenger waiting for the next checkpoint.

OpenAI's own internal data complicates it. Per a report from OpenAI Economic Research, median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025. Read the ordering carefully. The department that scaled agent usage hardest was not Engineering, the one whose work most resembles what these tools were built to do. It was Research, followed by Customer Support. Two functions where the work is messier, the success criteria are softer, and the path to automation is less obvious.

That ordering is the story. If capability were the binding constraint, coding-adjacent functions would lead, because that is where the models are strongest and the harness is most mature. Instead the lead changes hands. The departments pulling ahead are the ones that figured out how to wire an agent into their actual workflow and trust the output enough to depend on it.

That is not a capability problem. It is a deployment-and-trust problem. And it relocates the scarce resource in the agent economy from the lab to the org chart.

The growth ordering contradicts the capability story

Start with what the numbers actually say. OpenAI reports that through August 2025, the average internal worker spent less than 10% of their tokens on Codex. Over the following six months, usage deepened across departments, with Research posting the largest jump.

If you believe agent adoption is gated by model capability, you would predict a specific shape: the functions whose work most closely resembles the training target (code) should adopt fastest, because that is where the tool clears the reliability bar first. Engineering should lead. Everything else trails as the models slowly get good enough to handle fuzzier tasks.

The data shows the opposite ordering. Research grew 56x. Customer Support grew 32x. Engineering grew 27x. The two functions that scaled hardest are the ones where the work is least like writing a unit test and where, on a pure capability read, the tool should have been least ready.

There is a tempting dismissal here: Engineering started from a higher base, so its multiple looks smaller. Possibly. But that cuts against the capability thesis rather than rescuing it. If Engineering was already saturated with agent usage by late 2025, then the model was already good enough for the easy case a long time ago, and the interesting growth, the new ground being taken, is happening everywhere else. Either way the frontier of adoption is not the coding frontier.

The cleaner explanation is that capability stopped being the binding constraint some time ago, and what we are watching now is each department independently discovering how to deploy what already works.

The value was never in the model. It was in the harness.

This is where the Harness Hypothesis earns its keep: the value in AI isn't in the model, it's in the harness that connects the model to the world. A model that can draft a legal memo is worthless to a legal team until something feeds it the right documents, constrains its output to the right format, routes it past a reviewer, and files the result where the next person expects to find it.

Research and Customer Support are interesting precisely because their harnesses are harder to build than Engineering's. A coding agent inherits a mature harness for free: the repository, the test suite, the CI pipeline, the pull-request review. Those are trust-and-verification structures that existed long before agents did. Drop a model into that environment and the scaffolding that catches its mistakes is already there.

Research has no equivalent. Customer Support has fragments of one (ticketing systems, knowledge bases) but nothing that validates an answer the way a test suite validates a function. For those functions to scale agent usage 30-50x, somebody had to build the missing harness: the retrieval layer, the review checkpoint, the place where a human decides the output is good enough to ship.

The fact that they did, and did it faster than Engineering extended its existing lead, tells you the marginal effort moved. It moved out of the model and into the connective tissue around it. The departments winning are the ones investing in harness, not the ones waiting for a smarter model.

This is the part the capability narrative cannot see, because it treats everything around the model as a solved detail. It is not a detail. It is the work.

Where agent components sit on the evolution axis

A Wardley map clarifies why this is happening now and not eighteen months ago. Map the agent value chain from genesis to commodity and the components have been evolving at very different speeds.

The model has moved fastest toward commodity. Frontier capability that was a research artifact in 2023 is now something you rent by the token, with multiple interchangeable providers. When a component commoditizes, it stops being a source of differentiation. Nobody wins by having access to a good model anymore, because everyone does.

The harness, by contrast, is still in the custom-built-to-product transition. Retrieval pipelines, review workflows, the orchestration that chains agent steps into a department's actual process: these are still being assembled by hand, per team, with no settled pattern. That is exactly where value accrues, because that is where the work is non-obvious and the outcomes vary wildly between teams that do it well and teams that don't.

The internal-usage spread inside one company makes this visible. Same models. Same infrastructure budget. Same talent pool. Yet Research scaled agent usage four times harder than Legal did. The variable that differs across those departments is not the model they had access to. It is how much usable harness they built and how much they trusted the result.

For anyone mapping the ecosystem, the implication is direct. The interesting tooling layer for the next two years is not another model wrapper. It is the orchestration, retrieval, and review infrastructure that turns a capable model into a dependable colleague. That layer is where the evolution pressure is, and where the margin still lives.

Trust is the second half of the constraint, and the law is about to enforce it

Harness solves the mechanical problem of connecting a model to a workflow. Trust solves the harder one: deciding the output is good enough to act on without a human re-checking every line. A department only scales agent usage 30x when it stops treating each output as suspect. That decision is organizational and, increasingly, legal.

The legal dimension just got sharper. Writing on a recent German ruling that held Google liable for errors in its AI overviews, Bruce Schneier argues that AI agents are agents of the organization that deploys them and should be treated by the law as such. If a company hired human writers and they produced inaccurate summaries, the company would be liable. Letting it hide behind faulty AI in the same circumstance, he writes, would be a massive handout and would introduce disastrous incentives.

That principle, if it holds, prices trust explicitly. The output of a deployed agent is the deploying organization's output, full stop. Which means the trust decision a department makes when it scales usage is also a liability decision. Research and Customer Support didn't just build harness. They accepted ownership of what their agents produce.

This reframes the Autonomy Spectrum. Deployments run from copilot to full autonomy, and most failures come from sitting at the wrong point. The liability framing says the right point is not a technical question about model accuracy. It is a question about how much consequence the organization is willing to own for a given task. A support reply drafted by an agent and approved by a human sits at a defensible point. The same reply sent autonomously, with the company liable for every error, sits somewhere very different.

The departments scaling fastest made that call deliberately, task by task. That is the trust work, and it is not something a better model does for you.

The Shadow Agent Problem scales with the same curve

There is a darker reading of a 56x usage jump, and it deserves a paragraph. Rapid, department-led adoption is exactly the pattern that produces the Shadow Agent Problem: agents wired into real workflows by individuals or teams without central approval, carrying the same risk as Shadow IT but with broader system access.

The OpenAI data describes adoption that is plainly bottom-up. Different departments scaling at wildly different rates is not the signature of a centrally mandated rollout. It is the signature of teams independently finding use cases and running with them. That is healthy for discovery and dangerous for governance, often at the same time.

When a Research team builds its own retrieval-and-review harness to feed an agent, the question that follows is what that agent can reach. The same week OpenAI's numbers landed, a GitHub advisory described a credential-exfiltration chain in which a low-privilege user with a fresh SSO account could turn a certificate-management tool into an AWS credential-theft vector, partly because the system auto-activated new identities with no admin approval. The failure there is not exotic AI behavior. It is ordinary over-permissioned access, the kind that proliferates when capable tooling is deployed faster than governance can keep up.

That is the tension inside the good news. The same organizational dynamism that lets Research scale 56x is the dynamism that installs agents nobody approved, granted permissions nobody audited, against data nobody classified. The trust decision and the security decision are the same decision viewed from two angles. Departments that made the trust call well, scoping access, inserting review, owning the output, also closed the shadow-agent gap. Departments that only chased the usage number opened it.

The lesson is not to slow down. It is that the deployment constraint and the security constraint are the same constraint, and you cannot solve one while ignoring the other.

What this means for the tools you actually choose

If the binding constraint is harness and trust rather than raw capability, the buying logic for agent tooling inverts. The question stops being which platform has the smartest model and becomes which platform makes deployment and verification easiest for a non-coding team.

That favors a specific kind of product. Observability and tracing tooling, the layer that lets you see what an agent actually did across a run, becomes load-bearing rather than optional. The continued, granular release cadence of agent-observability platforms like Langfuse, shipping trace-level features such as subtree wall-clock duration, is a tell. You don't build that depth of inspection tooling for a market that trusts its agents blindly. You build it for a market that has decided it owns the output and needs to prove what happened.

For the power user choosing between OpenClaw, Claude Managed Agents, Hermes, or Paperclip, the practical screen shifts accordingly. Ask less about benchmark scores and more about the harness questions:

Can a non-coder wire this into an existing workflow without a platform team?
Where does a human review checkpoint sit, and can you move it along the autonomy spectrum per task?
What can the agent reach, and can you scope that down to the minimum?
When something goes wrong, can you reconstruct exactly what the agent did?

None of those are model questions. All of them are deployment questions, and they are the ones that separated Research from Legal inside a single company with identical model access.

The market is still pricing agents on capability because capability is legible: it benchmarks, it demos, it headlines. Deployment and trust are illegible by comparison, which is exactly why they are where the advantage now sits. OpenAI's own token data is the clearest evidence yet that the teams winning with agents are not the ones with the best model. They are the ones who did the unglamorous work of connecting it to the world and deciding to trust the result.

/Figures

Median internal Codex output token growth by department (since Nov 2025)

Research56x growth

Customer Support32x growth

Engineering27x growth

Legal13x growth

OpenAI Economic Research, per Latent Space. The fastest-growing functions are not the coding-adjacent ones. Source

/Sources

/Key Takeaways

OpenAI's internal Codex usage grew 56x in Research and 32x in Customer Support since November 2025, outpacing Engineering's 27x. The fastest-growing functions are the ones least suited to automation on a pure-capability read.
If capability were the binding constraint, coding-adjacent work would lead adoption. It doesn't. The constraint moved to harness (connecting the model to a workflow) and trust (owning the output).
Coding agents inherit a mature trust harness for free: repos, tests, CI, code review. Research and Support had to build that scaffolding themselves, and the ones that did pulled ahead.
Schneier's read of the German liability ruling prices trust explicitly: a deployed agent's output is the deploying organization's output, and its liability. That makes the trust decision a deliberate, per-task call.
Bottom-up adoption at these rates is the Shadow Agent Problem in motion. The deployment constraint and the security constraint are the same constraint viewed from two angles.
For tool selection, screen on deployment ergonomics and observability, not benchmark scores. The departments that won had identical model access and different harnesses.

Sources for this article

11 collected in pack · 4 cited & verified in body

This is the full source pack collected for the story — the pool the writer cites from, which is why the pack count can exceed the citations in the body. Tier labels reflect domain authority; freshness is re-checked daily. How each load-bearing claim bound to this pack is itemized in the claims panel below. What the tiers mean · How we verify.

[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.
www.latent.space
Reputable
AI and Liability
simonwillison.net
Reputable
Release: datasette-export-database 0.3a2
simonwillison.net
Reputable
The Sequence AI of the Week #883: Qwen is Getting Into Robotics
thesequence.substack.com
Community
An Interview with Figma CEO Dylan Field About Design and AI
stratechery.com
Reputable
The Sequence Knowledge #882: A New Series About Distillation
thesequence.substack.com
Community
My Vibe Coding Adventure, The App and the Experience, Ten Takeaways
stratechery.com
Reputable
Memory Chips and China, Microsoft and Chinese Models
stratechery.com
Reputable
CVE-2026-55166 - GitHub Advisory Database
github.com
Official
Release v3.199.0 · langfuse/langfuse
github.com
Reputable
CVE-2026-48713 - GitHub Advisory Database
github.com
Official

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

5 confirmed2 analysis

5/5 bound to a pack source

Confirmed
OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.
[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.
Confirmed
Through August 2025, the average OpenAI internal worker spent less than 10% of their tokens on Codex, and Research posted the largest subsequent jump.
[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.
Analysis
The departments that scaled agent usage hardest are the ones whose work least resembles coding, which contradicts a capability-gated adoption model.
Analysis
Coding agents inherit a mature trust-and-verification harness (repos, tests, CI, code review) that non-coding functions like Research and Support had to build themselves.
Confirmed
Bruce Schneier argues, citing a German ruling holding Google liable for AI overview errors, that AI agents are agents of the deploying organization and should be treated by law as such.
AI and Liability
Confirmed
A GitHub advisory described a credential-exfiltration chain where a low-privilege user with a fresh SSO account could turn a certificate tool into an AWS credential-theft vector, partly due to auto-activated identities with no admin approval.
CVE-2026-55166 - GitHub Advisory Database
Confirmed
Langfuse continues shipping granular agent-observability features such as trace subtree wall-clock duration in release v3.199.0.
Release v3.199.0 · langfuse/langfuse

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.