News

The E2B Bug Fix That Explains Why Your Agent Hangs on the Second Run

A quiet E2B SDK release patched a connection bug that only appears on repeated runs. It points at the unglamorous layer where agent reliability actually lives, and where the next round of vendor competition will be fought.

PinchJun 16, 2026Partially verified · 0/3 claims bound Part of Agent Evaluation & Reliability

Hero image for "The E2B Bug Fix That Explains Why Your Agent Hangs on the Second Run" — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

A two-line changelog entry from E2B describes a failure that only shows up the second time you run an agent. The fix is small. What it reveals about the agent infrastructure market is not.

Most release notes are written to be skimmed past. The one E2B published for its Python sandbox SDK on June 15 is no exception, and that is exactly why it is worth reading carefully.

The headline change is a cache-keying fix: the SDK now binds its connection machinery to the actual event loop object rather than to a numeric identifier that the underlying runtime can recycle the instant a previous loop closes. In plain terms, an agent that runs a task, finishes, and then runs another task could quietly inherit a dead network connection from the first run. The first run works. The second one hangs, or fails, for reasons that look like nothing.

This is not a feature. It is the opposite of a feature: it is the removal of a defect that the people relying on these agents never knew they had. And that is the signal. The agent market spent 2025 competing on which model could reason best. The infrastructure releases shipping in mid-2026 suggest the real contest has moved to a less photogenic layer, the one that decides whether your agent works the same way on Tuesday as it did on Monday. The companies that win the next phase will not be the ones with the smartest model. They will be the ones whose plumbing fails the least.

The bug only exists because agents run more than once

The specific failure E2B describes is narrow and precise. The SDK caches the connection it uses to talk to its remote sandboxes, and it was indexing that cache by a numeric loop identifier. The runtime can reuse that identifier almost immediately after a loop is closed, so a sequence of separate runs could pick up a connection that belonged to a previous, already-dead session. The E2B release notes describe keying the cache by the loop object itself, held weakly, with entries released automatically when the loop is collected.

Notice the shape of the problem. It does not appear on the first run. It appears on the second, the third, the hundredth, whenever the same process spins up a fresh execution context after tearing down the last one. That is precisely the access pattern of an agent doing repeated, discrete tasks: handle a request, finish, reset, handle the next one.

A tool used once would never surface this. A tool used continuously, in a loop, by something that does not get tired, surfaces it constantly. The bug is a fingerprint of how agents actually behave in production. They are not interactive sessions a human babysits. They are long-lived processes that do the same setup-and-teardown dance thousands of times, and any defect that compounds across repetitions becomes their defining failure mode.

The sandbox is the harness, and the harness is where the value sits

It helps to be clear about what E2B sells. It runs the disposable, isolated environments where an agent's code actually executes: the sandbox that catches the blast radius when a model decides to run something it shouldn't. That is not the model. It is the thing standing between the model and the rest of the world.

This maps cleanly onto what we have called the Harness Hypothesis: the value in AI is not in the model, it is in the harness that connects the model to the world. A frontier model that can write flawless code is worthless to an operator if the code has nowhere safe to run, no reliable way to be invoked twice in a row, and no guarantee that the second invocation behaves like the first.

The sandbox layer is the harness made literal. And the fact that a sandbox vendor's most consequential recent change is a reliability fix, not a capability addition, tells you where this part of the stack sits on its evolution curve. On a Wardley map, the model layer is still thrashing around in genesis and custom-build. The execution-environment layer is visibly sliding toward product and commodity, where the competitive question stops being "can it do this?" and becomes "does it do this every single time?"

That shift is what reliability fixes signal. You do not spend engineering effort hardening connection caches for a component nobody depends on. You harden it because customers have moved past evaluating it and started trusting it, which means they have also started getting burned by it. The bug report is a backhanded compliment: people are running this in anger, at volume, and the failure modes that only show up at volume have started to bite.

Retries are an admission, not a feature

The same release adds something else worth examining: the SDK now retries connection establishment to E2B's API and its sandbox clients three times by default, configurable through an environment variable. The release notes frame this as a straightforward robustness improvement.

Read it less charitably and it is more interesting. You add automatic retries when you have accepted that connections will fail. Not might fail, will fail, often enough that surfacing every failure to the caller produces an unacceptable experience. The retry is a layer of defense built on the assumption that the layer beneath it is unreliable.

This is the Swiss Cheese Model arriving in agent infrastructure: accidents happen when the holes in multiple defense layers line up, so you stack layers. A transient network hiccup that would have killed a run now gets absorbed by retry logic. Combined with the cache fix that stops runs from inheriting dead connections, you can see two independent holes being patched in the same release. Neither alone was catastrophic. Aligned, they produced exactly the kind of intermittent, unreproducible failure that erodes an operator's trust faster than any outage.

For the person running agents rather than building them, the practical takeaway is uncomfortable: the reliability you experience is the sum of many small defenses you cannot see, layered by a vendor who knows the underlying network is hostile. When one of those layers is missing, you do not get a clear error. You get a flaky agent, and flakiness is harder to diagnose and harder to forgive than an honest failure.

The whole ecosystem shipped reliability on the same day

E2B did not patch in isolation. Look at what else landed on June 15 across the agent-tooling stack and a pattern appears.

Google's agent development kit shipped a release whose notes are entirely bug fixes, including ensuring a streaming response always yields its final frame so a consumer is never left waiting on output that never arrives, per the ADK 1.35.1 notes. The agent framework Agno shipped a release migrating its web-search backend to a newly stable API and pinning the version floor, described in the Agno v2.6.16 changelog. The observability platform Langfuse shipped a release that, among other changes, blocks a legacy export path for new cloud users, documented in the Langfuse v3.186.0 notes.

None of these is a capability story. Every one is about making an existing behavior dependable, predictable, or correct. A streaming response that always closes properly. A search backend pinned to a stable contract. An export path locked down. A connection cache that does not poison the next run.

The pattern, and this is analysis rather than coordination, resembles a category-wide molt. Open-source agent infrastructure tends to move through predictable phases: rapid growth, then a hardening phase where the project's energy shifts from adding surface area to defending the surface it already has. A single such release proves nothing. Four of them, across sandboxing, frameworks, and observability, on the same day, suggests the layer beneath the models has collectively entered its hardening phase. The growth happened. Now comes the unglamorous work of making it not break.

Why this matters to people who run agents but never read a changelog

The reader who configures an agent rather than building one might reasonably ask why a connection-cache fix in a sandbox SDK deserves their attention. The answer is that this is the layer that determines whether your setup is trustworthy, and you have almost no visibility into it.

When you wire an agent into a workflow, you are stacking dependencies you will never inspect. Your orchestration tool calls a framework, which calls a sandbox SDK, which opens a network connection to a remote execution environment. A defect three layers down, like a recycled connection identifier, manifests at your layer as an agent that worked in testing and fails intermittently in production. You will blame your prompt. You will blame the model. You will rarely suspect the plumbing, because the plumbing is invisible until it leaks.

This is the operational cost of the Autonomy Spectrum. The more autonomous your deployment, the less a human is watching each run, and the more these compounding, low-level reliability defects matter. A copilot failure gets caught by the person sitting in front of it. A fully autonomous failure runs silently until something downstream breaks, and by then the cause is buried under thousands of successful runs.

The practical move for an operator is not to read SDK changelogs. It is to treat reliability as a procurement criterion, not an afterthought. When you evaluate agent infrastructure, the question "how often does this fail, and how loudly?" deserves more weight than "how capable is the underlying model?" The model is increasingly a commodity you can swap. The harness, the part that decides whether autonomy is actually safe to leave unattended, is where your real risk lives.

The competitive map is shifting under the models

Step back and the strategic picture clarifies. Aggregation Theory says platforms win by aggregating demand and then commoditizing supply. In the agent stack, the models are rapidly becoming the commoditized supply, interchangeable, swappable, racing each other to the floor on price. The layer that aggregates the user relationship, the harness that the operator actually depends on day to day, is where durable advantage accumulates.

E2B's release is a small data point in that larger movement. A vendor at the execution layer is investing in reliability because reliability is what its customers will eventually pay a premium for, and what they will switch away from a competitor to get. Capability gets you evaluated. Reliability gets you deployed and kept. The vendor that owns the dependable harness owns the relationship, regardless of which model is bolted on top this quarter.

There is a trade-off lurking here too, the Capability versus Controllability frontier. More capable models are harder to contain, which puts more pressure on the sandbox to be both permissive enough to be useful and reliable enough to be trusted. Every connection retry and every cache fix is the harness layer absorbing the consequences of the model layer getting more powerful. The smarter the thing inside the box, the more the box itself has to be bulletproof.

None of this is visible in a benchmark. It is visible in a changelog, in the slow accumulation of fixes that no marketing team will ever feature, on a release that most readers would skip. Which is the point. The interesting story in agent infrastructure right now is not the next model announcement. It is the quiet, coordinated turn toward making the existing stack dependable, because dependability is the moat the model layer can no longer provide.

/Figures

Reliability releases across the agent stack, June 15 2026

16:00 UTC
Langfuse v3.186.0
Observability: blocks a legacy export path for new cloud users.
18:51 UTC
E2B python-sdk 2.29.0
Sandbox: fixes recycled-connection bug across sequential runs; adds connection retries.
20:55 UTC
Agno v2.6.16
Framework: migrates search backend to a stable API and pins the version floor.
20:58 UTC
ADK 1.35.1
Google's kit: bug fixes including always yielding a streaming response's final frame.

Four tooling releases on a single day, none capability-led, all focused on correctness and dependability. Source

/Sources

/Key Takeaways

The E2B fix only matters because agents run the same setup-and-teardown thousands of times; the bug is a fingerprint of production agent behavior, not a developer edge case.
The sandbox layer is the harness between model and world. A reliability fix, not a capability add, signals this layer has moved past evaluation toward commodity trust.
Automatic connection retries are an admission that the network will fail, not a feature. Agent reliability is a stack of invisible defenses an operator never sees until one is missing.
Four agent-infrastructure tools shipped reliability-focused releases on the same day, suggesting the whole layer beneath the models has entered a hardening phase.
For operators: treat reliability as a procurement criterion that outweighs raw model capability. The model is swappable; the dependable harness is where your real risk lives.

Sources for this article

7 collected in pack · 4 cited & verified in body

This is the full source pack collected for the story — the pool the writer cites from, which is why the pack count can exceed the citations in the body. Tier labels reflect domain authority; freshness is re-checked daily. How each load-bearing claim bound to this pack is itemized in the claims panel below. What the tiers mean · How we verify.

Release Release 1.35.1 · google/adk-python
github.com
Reputable
Release v2.6.16 · agno-agi/agno
github.com
Reputable
Release @e2b/python-sdk@2.29.0 · e2b-dev/E2B
github.com
Reputable
Release v3.186.0 · langfuse/langfuse
github.com
Reputable
TIL: Cloudflare CAPTCHA on at least one ampersand
simonwillison.net
Reputable
“They screwed us”: Personality clashes sent Anthropic’s models offline
simonwillison.net
Reputable
A quote from Julia Evans
simonwillison.net
Reputable

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

3 confirmed4 analysis

0/3 bound to a pack source

Confirmed
The E2B Python SDK 2.29.0 now keys its async HTTP transport and client caches by the event loop object held weakly rather than by a numeric loop id, because the runtime can reuse a closed loop's id almost immediately, causing sequential runs to inherit a dead connection.
No matching pack item — claim recorded but not bound to a source.
Analysis
The bug only manifests on repeated runs, matching the setup-and-teardown access pattern of agents doing discrete tasks in a loop rather than interactive human sessions.
Analysis
E2B runs isolated execution environments that sit between the model and the rest of the world, making it a harness-layer rather than model-layer vendor.
Confirmed
The E2B SDK now retries connection establishment to its API and sandbox clients three times by default, configurable via an environment variable.
No matching pack item — claim recorded but not bound to a source.
Confirmed
On June 15 2026, Google's ADK 1.35.1, Agno v2.6.16, and Langfuse v3.186.0 all shipped releases focused on bug fixes, stability, and correctness rather than new capabilities.
No matching pack item — claim recorded but not bound to a source.
Analysis
Low-level reliability defects in agent infrastructure surface to operators as intermittent production failures that are easily misattributed to the prompt or model.
Analysis
As models commoditize, durable competitive advantage in the agent stack accrues to the reliable harness layer that owns the operator relationship.

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.