The fastest-growing agent workloads inside OpenAI are not coding. That ordering tells you where the real constraint sits, and it isn't the model.
The consensus story about AI agents is a capability story. Bigger models, longer context, better reasoning, and adoption follows. It is a tidy narrative because it makes the model the protagonist and everyone else a passenger waiting for the next checkpoint.
OpenAI's own internal data complicates it. Per a report from OpenAI Economic Research, median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025. Read the ordering carefully. The department that scaled agent usage hardest was not Engineering, the one whose work most resembles what these tools were built to do. It was Research, followed by Customer Support. Two functions where the work is messier, the success criteria are softer, and the path to automation is less obvious.
That ordering is the story. If capability were the binding constraint, coding-adjacent functions would lead, because that is where the models are strongest and the harness is most mature. Instead the lead changes hands. The departments pulling ahead are the ones that figured out how to wire an agent into their actual workflow and trust the output enough to depend on it.
That is not a capability problem. It is a deployment-and-trust problem. And it relocates the scarce resource in the agent economy from the lab to the org chart.
The growth ordering contradicts the capability story
Start with what the numbers actually say. OpenAI reports that through August 2025, the average internal worker spent less than 10% of their tokens on Codex. Over the following six months, usage deepened across departments, with Research posting the largest jump.
If you believe agent adoption is gated by model capability, you would predict a specific shape: the functions whose work most closely resembles the training target (code) should adopt fastest, because that is where the tool clears the reliability bar first. Engineering should lead. Everything else trails as the models slowly get good enough to handle fuzzier tasks.
The data shows the opposite ordering. Research grew 56x. Customer Support grew 32x. Engineering grew 27x. The two functions that scaled hardest are the ones where the work is least like writing a unit test and where, on a pure capability read, the tool should have been least ready.
There is a tempting dismissal here: Engineering started from a higher base, so its multiple looks smaller. Possibly. But that cuts against the capability thesis rather than rescuing it. If Engineering was already saturated with agent usage by late 2025, then the model was already good enough for the easy case a long time ago, and the interesting growth, the new ground being taken, is happening everywhere else. Either way the frontier of adoption is not the coding frontier.
The cleaner explanation is that capability stopped being the binding constraint some time ago, and what we are watching now is each department independently discovering how to deploy what already works.
The value was never in the model. It was in the harness.
This is where the Harness Hypothesis earns its keep: the value in AI isn't in the model, it's in the harness that connects the model to the world. A model that can draft a legal memo is worthless to a legal team until something feeds it the right documents, constrains its output to the right format, routes it past a reviewer, and files the result where the next person expects to find it.
Research and Customer Support are interesting precisely because their harnesses are harder to build than Engineering's. A coding agent inherits a mature harness for free: the repository, the test suite, the CI pipeline, the pull-request review. Those are trust-and-verification structures that existed long before agents did. Drop a model into that environment and the scaffolding that catches its mistakes is already there.
Research has no equivalent. Customer Support has fragments of one (ticketing systems, knowledge bases) but nothing that validates an answer the way a test suite validates a function. For those functions to scale agent usage 30-50x, somebody had to build the missing harness: the retrieval layer, the review checkpoint, the place where a human decides the output is good enough to ship.
The fact that they did, and did it faster than Engineering extended its existing lead, tells you the marginal effort moved. It moved out of the model and into the connective tissue around it. The departments winning are the ones investing in harness, not the ones waiting for a smarter model.
This is the part the capability narrative cannot see, because it treats everything around the model as a solved detail. It is not a detail. It is the work.
Where agent components sit on the evolution axis
A Wardley map clarifies why this is happening now and not eighteen months ago. Map the agent value chain from genesis to commodity and the components have been evolving at very different speeds.
The model has moved fastest toward commodity. Frontier capability that was a research artifact in 2023 is now something you rent by the token, with multiple interchangeable providers. When a component commoditizes, it stops being a source of differentiation. Nobody wins by having access to a good model anymore, because everyone does.
The harness, by contrast, is still in the custom-built-to-product transition. Retrieval pipelines, review workflows, the orchestration that chains agent steps into a department's actual process: these are still being assembled by hand, per team, with no settled pattern. That is exactly where value accrues, because that is where the work is non-obvious and the outcomes vary wildly between teams that do it well and teams that don't.
The internal-usage spread inside one company makes this visible. Same models. Same infrastructure budget. Same talent pool. Yet Research scaled agent usage four times harder than Legal did. The variable that differs across those departments is not the model they had access to. It is how much usable harness they built and how much they trusted the result.
For anyone mapping the ecosystem, the implication is direct. The interesting tooling layer for the next two years is not another model wrapper. It is the orchestration, retrieval, and review infrastructure that turns a capable model into a dependable colleague. That layer is where the evolution pressure is, and where the margin still lives.
Trust is the second half of the constraint, and the law is about to enforce it
Harness solves the mechanical problem of connecting a model to a workflow. Trust solves the harder one: deciding the output is good enough to act on without a human re-checking every line. A department only scales agent usage 30x when it stops treating each output as suspect. That decision is organizational and, increasingly, legal.
The legal dimension just got sharper. Writing on a recent German ruling that held Google liable for errors in its AI overviews, Bruce Schneier argues that AI agents are agents of the organization that deploys them and should be treated by the law as such. If a company hired human writers and they produced inaccurate summaries, the company would be liable. Letting it hide behind faulty AI in the same circumstance, he writes, would be a massive handout and would introduce disastrous incentives.
That principle, if it holds, prices trust explicitly. The output of a deployed agent is the deploying organization's output, full stop. Which means the trust decision a department makes when it scales usage is also a liability decision. Research and Customer Support didn't just build harness. They accepted ownership of what their agents produce.
This reframes the Autonomy Spectrum. Deployments run from copilot to full autonomy, and most failures come from sitting at the wrong point. The liability framing says the right point is not a technical question about model accuracy. It is a question about how much consequence the organization is willing to own for a given task. A support reply drafted by an agent and approved by a human sits at a defensible point. The same reply sent autonomously, with the company liable for every error, sits somewhere very different.
The departments scaling fastest made that call deliberately, task by task. That is the trust work, and it is not something a better model does for you.
The Shadow Agent Problem scales with the same curve
There is a darker reading of a 56x usage jump, and it deserves a paragraph. Rapid, department-led adoption is exactly the pattern that produces the Shadow Agent Problem: agents wired into real workflows by individuals or teams without central approval, carrying the same risk as Shadow IT but with broader system access.
The OpenAI data describes adoption that is plainly bottom-up. Different departments scaling at wildly different rates is not the signature of a centrally mandated rollout. It is the signature of teams independently finding use cases and running with them. That is healthy for discovery and dangerous for governance, often at the same time.
When a Research team builds its own retrieval-and-review harness to feed an agent, the question that follows is what that agent can reach. The same week OpenAI's numbers landed, a GitHub advisory described a credential-exfiltration chain in which a low-privilege user with a fresh SSO account could turn a certificate-management tool into an AWS credential-theft vector, partly because the system auto-activated new identities with no admin approval. The failure there is not exotic AI behavior. It is ordinary over-permissioned access, the kind that proliferates when capable tooling is deployed faster than governance can keep up.
That is the tension inside the good news. The same organizational dynamism that lets Research scale 56x is the dynamism that installs agents nobody approved, granted permissions nobody audited, against data nobody classified. The trust decision and the security decision are the same decision viewed from two angles. Departments that made the trust call well, scoping access, inserting review, owning the output, also closed the shadow-agent gap. Departments that only chased the usage number opened it.
The lesson is not to slow down. It is that the deployment constraint and the security constraint are the same constraint, and you cannot solve one while ignoring the other.
What this means for the tools you actually choose
If the binding constraint is harness and trust rather than raw capability, the buying logic for agent tooling inverts. The question stops being which platform has the smartest model and becomes which platform makes deployment and verification easiest for a non-coding team.
That favors a specific kind of product. Observability and tracing tooling, the layer that lets you see what an agent actually did across a run, becomes load-bearing rather than optional. The continued, granular release cadence of agent-observability platforms like Langfuse, shipping trace-level features such as subtree wall-clock duration, is a tell. You don't build that depth of inspection tooling for a market that trusts its agents blindly. You build it for a market that has decided it owns the output and needs to prove what happened.
For the power user choosing between OpenClaw, Claude Managed Agents, Hermes, or Paperclip, the practical screen shifts accordingly. Ask less about benchmark scores and more about the harness questions:
- Can a non-coder wire this into an existing workflow without a platform team?
- Where does a human review checkpoint sit, and can you move it along the autonomy spectrum per task?
- What can the agent reach, and can you scope that down to the minimum?
- When something goes wrong, can you reconstruct exactly what the agent did?
None of those are model questions. All of them are deployment questions, and they are the ones that separated Research from Legal inside a single company with identical model access.
The market is still pricing agents on capability because capability is legible: it benchmarks, it demos, it headlines. Deployment and trust are illegible by comparison, which is exactly why they are where the advantage now sits. OpenAI's own token data is the clearest evidence yet that the teams winning with agents are not the ones with the best model. They are the ones who did the unglamorous work of connecting it to the world and deciding to trust the result.
/Figures
/Sources
/Key Takeaways
- OpenAI's internal Codex usage grew 56x in Research and 32x in Customer Support since November 2025, outpacing Engineering's 27x. The fastest-growing functions are the ones least suited to automation on a pure-capability read.
- If capability were the binding constraint, coding-adjacent work would lead adoption. It doesn't. The constraint moved to harness (connecting the model to a workflow) and trust (owning the output).
- Coding agents inherit a mature trust harness for free: repos, tests, CI, code review. Research and Support had to build that scaffolding themselves, and the ones that did pulled ahead.
- Schneier's read of the German liability ruling prices trust explicitly: a deployed agent's output is the deploying organization's output, and its liability. That makes the trust decision a deliberate, per-task call.
- Bottom-up adoption at these rates is the Shadow Agent Problem in motion. The deployment constraint and the security constraint are the same constraint viewed from two angles.
- For tool selection, screen on deployment ergonomics and observability, not benchmark scores. The departments that won had identical model access and different harnesses.


