On the product side, Perplexity, Manus, and Cursor are all racing to give agents a computer. On the research side, the new benchmarks assume one. The interesting companies sit underneath both.
For most of the last two years, the question that defined an agent product was which model it called. That question is getting boring. The more interesting question, the one the entire stack is quietly reorganizing around, is where the agent actually runs. Latent Space's recent interview with Daytona's Ivan Burazin frames the shift bluntly: on the product side, everyone is getting Computer. Perplexity, Manus, Cursor, and a growing list of others are wiring their agents directly into a runtime environment that can open browsers, execute shell commands, edit files, and persist state across long tasks. Meanwhile, the research side has already moved. Agentic evaluations like TerminalBench and GDPVal don't test whether a model can reason about a computer; they assume it has one. The benchmark IS the computer. What this leaves behind is a new layer of the stack that nobody had a name for eighteen months ago and that suddenly looks like the most defensible territory in the entire ecosystem. Burazin calls it the end of localhost. A more functional description: the execution layer. It's the part of the agent that isn't the model, isn't the harness, and isn't the user interface. It's the sandboxed, networked, stateful environment the agent thinks IN. And it's where the next round of consolidation is happening.
The product side stopped asking 'which model' and started asking 'which computer'
Look at the agent products people are actually paying for in 2026 and a pattern jumps out. Perplexity's agent doesn't just answer questions; it browses, clicks, and fills forms. Manus runs autonomous workflows inside a persistent environment. Cursor's background agents check out branches, run tests, and report back. The Daytona conversation flags all three by name as part of the same wave: 'everyone is getting Computer.' The framing matters. None of these products are differentiating on raw model intelligence anymore. They've largely converged on the same handful of frontier models underneath. What they're differentiating on is what the agent can DO between the moment you give it an instruction and the moment it returns an answer. That gap, which used to be a few hundred milliseconds of token generation, is now sometimes hours of actual work happening on an actual machine.
This is what the Harness Hypothesis predicted: the value isn't in the model, it's in the harness that connects the model to the world. But 2026 has produced a sharper version of that claim. The harness itself has split into two layers. There's the orchestration logic (what tools to call, in what order, with what context) and there's the execution environment (where those tools actually run). The first layer is increasingly commoditized through frameworks like LangChain, whose recent releases continue to focus on test infrastructure and provider integration rather than novel primitives. The langchain-openai 1.2.2 release, for instance, is mostly dependency bumps and bug fixes, the signature of a layer that's stabilized into plumbing. The second layer, the execution environment, is where the new money is going.
For the reader who configures agents but doesn't build them, the practical effect is simple. The agent products that feel meaningfully better in 2026 versus 2025 are not better because they switched models. They're better because they got a computer.
The research side gave the game away first
If you want to know where a product category is heading, watch what the benchmarks assume. The Daytona piece highlights two evaluations by name: TerminalBench and GDPVal. The pattern resembles a quiet but decisive shift. These aren't text benchmarks with tool-use bolted on. They're environments that presume the model can operate a computer end-to-end, with file systems, processes, network access, and the consequences that come with all three. Harbor, the framework referenced in the same context, is described as assuming a computer as a baseline rather than a feature.
This is a much bigger deal than it sounds. Benchmarks shape what labs optimize for, and labs shape what products inherit. When the benchmark for 'is this agent useful' becomes 'can it complete a multi-step task inside a real Linux environment,' every model trained to chase that benchmark inherits the assumption that a computer is part of the deal. The model layer doesn't just call out to a sandbox; it's trained against scenarios that look like sandboxes. The execution layer becomes part of the contract.
Meanwhile, the SDKs the labs ship are picking up primitives that only make sense in this context. The Anthropic SDK's 0.104.0 release added support for a thinking-token-count beta covering estimated tokens in thinking block deltas during streaming. That's an esoteric detail, but it points at the same direction: the surface area being instrumented is no longer the prompt and the response. It's the long-running, partially-visible internal process of an agent doing work over time. You don't add streaming token counts to thinking deltas unless your customers' agents are thinking for long enough that the counts matter.
The research and product sides are converging on the same shape from opposite directions. Researchers want environments that produce realistic agent traces. Products want environments that produce realistic agent outcomes. Both need the same thing underneath: a computer the agent can actually use, that someone else has to build and operate.
The new infrastructure unicorns are not where the last ones were
Latent Space's recent unicorn roundup names Exa, Modal, and TurboPuffer as the current generation of breakout AI infra companies. None of them are model labs. None of them are agent frameworks. They sit underneath both, providing search, compute execution, and vector storage respectively. The Daytona conversation positions itself adjacent to that list, framing the consolidating LLM OS stack as 'a standard toolkit' that a small set of infrastructure companies are riding.
This is a category shift worth dwelling on. In 2023 and 2024, the AI infrastructure conversation was dominated by vector databases, prompt orchestration, and observability. Those companies are still real and still growing. Langfuse, for example, continues to ship aggressive feature releases (v3.175.0 added observation type filters and exposed trace_context fields in the public API) which suggests healthy product velocity in the observability layer. But the new entrants in the unicorn tier are doing something different. They're providing the substrate the agent operates ON, not the dashboard the developer watches it FROM.
Reading this through Wardley Mapping terms, the components have shifted positions on the evolution axis. Prompt orchestration moved from genesis to product in roughly eighteen months. Observability is moving from product to commodity. Execution environments, sandboxes, browser pools, persistent VMs, are still close to the custom-built end of the axis but with an obvious commodity destination. The companies that show up as unicorns now are the ones who saw that destination early and started industrializing the supply.
What's notable is who ISN'T on the list. There are no agent-framework unicorns in the new cohort. The orchestration layer is producing healthy open source projects and steady commercial businesses, but it's not producing the breakout outcomes. The execution layer is.
Why 'the end of localhost' is the right slogan for this
Burazin's recurring phrase, the end of localhost, captures something important about why this is structurally different from prior dev-tools waves. For two decades, the default mental model of computing for any technical user has been: there's a machine in front of me, and the software runs on it. Cloud changed where the machine lives but not the model. You still SSH into something, you still have a shell, you still think in terms of a singular environment you're operating in.
Agents break this. An agent task might spin up a fresh environment, do an hour of work, commit some artifacts, tear the environment down, and never touch a 'computer' the user can point at. Multiply that by thousands of concurrent agent runs across a single product, and the unit of computing stops being the machine. It becomes the task. Each task gets its own ephemeral computer, lives for as long as the task lives, and disappears.
The infrastructure implications are significant. You need fast cold starts because tasks are short. You need strong isolation because tasks are untrusted by default (the agent is acting on instructions that may include prompt injection). You need persistent state that outlives any individual environment because tasks chain. You need observability that's keyed to the task, not the machine, because the machine is gone by the time anyone goes looking. This is a different shape than VMs, different from containers, different from serverless. It's closer to what serverless wanted to be but never quite delivered: genuinely ephemeral, genuinely stateless-by-default, genuinely priced per unit of work.
The Capability vs. Controllability Frontier is relevant here too. More capable agents need more capable environments to act in, but more capable environments are harder to keep safely isolated. Every infrastructure provider in this space is making an explicit trade-off about where on that frontier they sit. Tight isolation with limited capability (browser-only sandboxes) sits at one end; full Linux VMs with broad network access sit at the other. The market will probably support several positions, and the winners at each position will be the ones who industrialize their corner of the frontier first.
Datasette Agent is the small version of the same story
Not every example of this shift is a unicorn. Simon Willison's recent Datasette Agent announcement is, in miniature, the same pattern. Datasette Agent is described as an extensible AI assistant for Datasette, providing a conversational interface to data stored in the system, with a plugin (datasette-agent-charts) that lets it generate charts directly. Willison frames it as the moment his LLM Python library and Datasette finally come together.
What's worth noticing is the architecture this implies. The agent isn't a chat window that calls Datasette's API from outside; it's an agent that runs inside the Datasette environment, with direct access to the data, the query engine, and the plugin system. The 'computer' the agent gets isn't a generic Linux box. It's Datasette itself. The execution layer, for this product, is the application. This is the pattern reproducing fractally. Every meaningful agent product is converging on the same answer to a slightly different question: what environment should this agent natively inhabit?
For a generic agent doing arbitrary work, the answer is a sandboxed VM or browser session provided by infrastructure companies. For a domain-specific agent doing data analysis, the answer is the data tool itself. For a coding agent, the answer is the developer environment (which is exactly where Cursor and similar products land). The common thread is that 'agent' increasingly names a behavior that's deeply coupled to an environment, not a feature that can be bolted onto any chat interface.
The Aggregation Theory implication is uncomfortable for general-purpose agent platforms. If the most useful agent for a given task is the one that natively inhabits that task's environment, then domain-specific tools have a structural advantage in their own domain, and horizontal agent platforms have to compete by either owning multiple domains (expensive) or by being the substrate domain-specific agents run on (which is what Daytona, Modal, and the new infrastructure cohort are betting they can be).
What this resets for the rest of the stack
If the execution layer is where the next round of consolidation happens, several other layers have to reposition. The model layer keeps producing capability but extracts a shrinking share of the total value created by each agent task. This is the Commoditize Your Complement playbook at scale: the more powerful the surrounding stack, the more the model becomes infrastructure for someone else's product. The frontier labs know this, which is why they're all building first-party agent products of their own.
The framework layer (LangChain and its peers) is settling into the role of universal glue. The recent LangChain releases (langchain-openai 1.2.2, langchain-tests 1.1.9) read like a project optimizing for stability and breadth of provider support rather than aggressive feature expansion. That's appropriate for the layer's market position; once you're the default plumbing, the goal is to be reliable plumbing. The same is true of the Vercel AI SDK, whose recent releases (ai@6.0.190, ai@6.0.189) are patch-level changes around UI message parts and gateway dependencies. These are mature products doing mature-product work.
The observability layer is where the next interesting fight happens. If tasks are the unit of computing and environments are ephemeral, then traces become the primary artifact of an agent's existence. The agent's logs ARE the agent, for any retrospective purpose. Langfuse's continued investment in observation filtering and trace context APIs suggests this layer understands its own importance. But observability vendors will increasingly be measured by how well they integrate with the execution layer's native instrumentation, not by how much they can capture from outside it.
Meanwhile, the security story changes shape entirely. The Shadow Agent Problem becomes more acute when agents are running in environments the user can't see and didn't provision. An employee using a hosted agent product is, in effect, granting a third-party execution environment access to whatever the agent can reach. The FTC's recent settlement with Cox Media Group and two other firms over deceptive 'active listening' marketing claims is a useful reminder that regulators are paying attention to AI products that overstate what they're doing or where data is going. The execution layer makes those questions harder, not easier, because the user's mental model (a chat happens, an answer comes back) hides an enormous amount of activity that occurred on someone else's machine.
The bet to watch
The interesting bet hiding inside all of this is whether the execution layer consolidates around a small number of horizontal infrastructure providers (the Modals and Daytonas of the world) or whether it fragments into vertical, domain-specific environments (the Datasette Agents, the Cursors, the data-warehouse-native agents). The honest answer is probably both, in a Bowling Alley sequence. Horizontal providers win the generic case and the long tail. Vertical environments win the domains where the data, the workflow, and the user's expectations are tightly coupled enough that a general-purpose sandbox can't compete.
What's clear is that the framing 'which model does this agent use' is no longer the load-bearing question for a 2026 buyer evaluating agent products. The questions that matter now are different. Where does the agent run? What can it touch? How long can it run for? What happens to the environment when it's done? Who can see the trace? Those are infrastructure questions, and they're being answered by an infrastructure layer that didn't have a clear name a year ago.
The newsroom-relevant version of this: when an agent product ships a new capability and the press release talks about model upgrades, look underneath. The interesting change is almost always somewhere in the execution layer. That's where the work is happening now, and that's where the next several billion dollars of enterprise value is going to accrue.
/Sources
- Giving Agents Computers — Ivan Burazin, Daytona
- [AINews] New AI Infra unicorns: Exa, Modal, TurboPuffer
- Datasette Agent
- Release v0.104.0 · anthropics/anthropic-sdk-python
- Release langchain-openai==1.2.2 · langchain-ai/langchain
- Release langchain-tests==1.1.9 · langchain-ai/langchain
- Release ai@6.0.190 · vercel/ai
- Release ai@6.0.189 · vercel/ai
- Release v3.175.0 · langfuse/langfuse
- FTC Active Listening Settlement
/Key Takeaways
- The defining question for agent products in 2026 is no longer which model they use but which execution environment they run in.
- Perplexity, Manus, and Cursor are converging on the same architecture: an agent paired with a real computer it can operate over long tasks.
- Agentic benchmarks like TerminalBench and GDPVal assume a computer as baseline, which forces the model layer to inherit that assumption too.
- The new AI infrastructure unicorns (Exa, Modal, TurboPuffer, and adjacent players like Daytona) sit underneath both models and frameworks, providing the substrate agents act on.
- Framework layers like LangChain and the Vercel AI SDK are settling into mature plumbing; the velocity has moved beneath them.
- Domain-specific agent products (Datasette Agent, Cursor) suggest the execution layer will be both horizontal and vertical, not one or the other.
- When evaluating an agent product, look past the model and ask where the agent runs, what it can touch, and who owns the trace.


