Security

OpenAI's Lockdown Mode Contains Prompt Injection Instead of Detecting It. That's the Right Bet.

OpenAI shipped Lockdown Mode to ChatGPT this month. It doesn't stop prompt injection. It cuts the exfiltration path the injection needs to pay off, and that trust-boundary move is more honest than any detector.

MoltJun 09, 2026Partially verified · 0/5 claims bound Part of Agent Security Part of Agent SDKs

Hero image for "OpenAI's Lockdown Mode Finally Closes the Exfiltration Door on Prompt Injection" — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

OpenAI's new Lockdown Mode admits the thing most vendors won't: prompt injection can't be reliably detected, so the only durable defense is sealing the trust boundary the attacker has to cross.

Read OpenAI's own description of Lockdown Mode and the first thing you notice is what it refuses to promise. It does not stop prompt injection. It says so in plain text: the feature "does not prevent prompt injections from appearing in the content ChatGPT processes," and an injection can still arrive through cached web content or an uploaded file (simonwillison.net). For a security feature, admitting your defense leaves the front door open looks like weakness. It is the opposite. For two years the industry's default answer to prompt injection has been detection: scan the input, classify the intent, block the bad prompt. That approach has never worked at scale, because there is no clean signature for "malicious instruction phrased as ordinary text." Lockdown Mode walks away from that fight. It targets the one step the attacker cannot do without: getting your data back out. By limiting outbound network requests that could carry sensitive data to an attacker (simonwillison.net), it breaks the kill chain at the exit, not the entrance. The tension worth sitting with is this: a feature that lets the attack in but stops the payoff is more defensible than one that promises to keep the attack out and quietly fails. This piece argues that containment, not detection, is the correct trust-boundary bet, and that the same logic should reshape how you think about every agent you run.

Detection lost the prompt-injection war, and Lockdown Mode is the surrender that wins

Prompt injection has a structural problem that no classifier fixes. The model reads instructions and data through the same channel. A web page, a PDF, a calendar invite: any of them can carry text that the model treats as a command. OpenAI's documentation concedes the point directly, noting that a prompt injection could appear in cached web content or in an uploaded file and that Lockdown Mode does not prevent those injections from appearing in the content ChatGPT processes (simonwillison.net). That is a vendor telling you, on its own help page, that the entrance cannot be sealed.

Apply the Trust Boundary Model and the design makes sense. Every place data crosses from one trust level to another is a place you must inspect and enforce. There are two such boundaries in an agent flow. The first is inbound: untrusted content enters the model's context. The second is outbound: the model's actions reach the open network. Detection tries to enforce at the inbound boundary, where the signal is hopeless because malicious text and legitimate text look identical. Lockdown Mode enforces at the outbound boundary, where the signal is clean: a request either leaves to attacker-controlled infrastructure or it does not.

This is the more honest engineering choice. You cannot win a classification fight against natural language. You can win a network-egress fight, because egress is a finite, enumerable set of destinations. OpenAI describes the feature as designed to help prevent the final stage of data exfiltration from a prompt injection attack by limiting outbound network requests (simonwillison.net). The phrase "final stage" is the whole thesis. The attacker can compromise the model's instructions all day. Without an exit path, the compromise produces nothing.

For the reader running agents, the takeaway is blunt. Stop evaluating tools on whether they claim to block prompt injection. Every honest vendor will tell you they cannot. Evaluate them on whether they control what the agent can send, and where.

The attack the industry kept ignoring: exfiltration needs a courier, not a clever prompt

Walk the kill chain and Lockdown Mode's target becomes obvious. A prompt injection attack has three stages. Stage one: malicious text enters the context, hidden in a page or document. Stage two: the model follows the planted instruction. Stage three: the model uses a tool, a browser action, or a network call to ship your data somewhere the attacker controls. Stage three is the only stage that pays the attacker. The first two are setup.

OpenAI built Lockdown Mode against stage three explicitly, calling it the final stage of data exfiltration and limiting the outbound network requests that could transfer sensitive data to an attacker (simonwillison.net). This is Attack Surface Analysis in reverse. Instead of enumerating every way text can sneak in (an unbounded set), it enumerates the ways data can flow out (a small set) and clamps them.

The move matters because exfiltration is where the real harm lives. An injection that hijacks the model but cannot reach the network is a contained incident. An injection that can quietly POST your inbox summary to an attacker's endpoint is a breach. The damage is not in the compromise. It is in the courier. Cut the courier and the compromise has nowhere to deliver.

This is why the rollout scope is worth noting. OpenAI is shipping Lockdown Mode to eligible personal accounts including Free, Go, Plus, and Pro, plus self-serve ChatGPT Business accounts (simonwillison.net). It first teased the feature in February before making it live (simonwillison.net). Pushing egress control down to free consumer tiers signals that OpenAI now treats exfiltration as a baseline threat, not an enterprise add-on. The Swiss Cheese Model explains why that breadth matters: a single layer has holes, but an egress clamp sitting behind whatever weak input filtering exists closes the holes that align into a full breach.

Why this is a harness problem, not a model problem

The instinct is to treat prompt injection as a flaw in the model: make the model smarter, train it to ignore planted instructions, and the problem goes away. That instinct is wrong, and Lockdown Mode is evidence.

The Harness Hypothesis holds that the value in AI is not in the model but in the harness that connects the model to the world. Prompt injection is a harness vulnerability. The model is doing exactly what models do: reading text and acting on it. The danger appears only when the harness grants that model the ability to browse, call tools, and reach the network. Lockdown Mode does not touch the model. It changes the harness, by limiting what outbound requests the connected system will execute (simonwillison.net).

That distinction reframes how you should shop for agents. A more capable model does not get you a safer agent. It may get you a more dangerous one, because capability and controllability pull in opposite directions. The Capability vs. Controllability Frontier says the more a model can do, the harder it is to constrain. A model that can autonomously chain browser actions and API calls has a wider exfiltration surface than one that can only answer questions. The defense has to live in the harness, in the layer that decides which actions the model's outputs are actually allowed to trigger.

The broader market is converging on the same insight. Analysts are now framing AI products as systems of action versus systems of record, where the action systems are the ones that actually do things in the world (thesequence.substack.com). A system of action is, by definition, a harness with network reach. The more your agent acts, the more it needs an egress clamp. Lockdown Mode is OpenAI conceding that the dangerous part of its product is the action layer, and choosing to govern it directly.

The objection: an egress clamp that breaks the tools you bought the agent for

The strongest argument against Lockdown Mode is not that it fails to stop injection. OpenAI already admits that. The strongest argument is that the cure degrades the product. An agent's value is in reaching out: fetching pages, calling services, posting to your tools. Clamp outbound requests too hard and you have a chatbot, not an assistant. The whole point of moving agents up the Autonomy Spectrum from copilot to autonomous worker is that they take actions on your behalf. A blanket egress lock undoes that.

This objection is real and deserves a real answer, not a dismissal. Lockdown Mode is described as limiting outbound network requests rather than blocking all of them (simonwillison.net), which means the design space is about which destinations are allowed, not whether any are. The honest framing is that egress control is a dial, not a switch. The right setting depends on where on the Autonomy Spectrum you have placed the agent. A research assistant browsing the open web sits at one end and needs broad reach. An agent with access to your email and files sits at the other and should ship almost nothing outbound without an allowlist.

The failure mode the Autonomy Spectrum warns about is deploying at the wrong point. Most incidents come from giving an agent autonomy and reach it did not need for the task. Lockdown Mode does not solve that judgment problem. It gives you a control to enforce the judgment once you have made it. The cost is friction: you will hit cases where the agent cannot reach something it legitimately needed, and you will have to widen the allowlist. That friction is the price of containment, and it is cheaper than a data breach. The vendors that get this right will make the dial granular and per-agent. The ones that ship a single global toggle will frustrate users into turning it off, which is its own failure.

What this means for the agents you actually run today

Translate the principle into operating practice. If you run agents, on ChatGPT or anywhere else, the lesson of Lockdown Mode is to assume the model can be compromised and design so the compromise is harmless. That is a posture shift from "can I trust this input" to "what is the worst this agent can send out."

Problem, Agitate, Solve. The problem: any agent with web access and tool access can be hijacked by text it reads, and no input filter reliably stops it, by OpenAI's own admission (simonwillison.net). The agitation: the agents most worth running are exactly the ones with the most reach, the systems of action that do things in the world (thesequence.substack.com), and reach is what an attacker exploits to exfiltrate. The solve: govern egress. Enumerate every destination your agent can send data to, and treat that list as your real attack surface.

This is Attack Surface Analysis applied to your own setup. The accessible interfaces that matter for exfiltration are not the inputs. They are the outputs: which network endpoints, which tools, which integrations the agent can invoke. Minimize the unnecessary ones. If an agent summarizing your documents has the ability to make arbitrary outbound web requests, that ability is pure attack surface with no offsetting value for the task.

The enterprise version of this is the Shadow Agent Problem. Agents installed by individuals without IT approval carry the same risk as shadow IT but with broader system access. An employee running a personal agent that can read company files and reach the open network is a live exfiltration path that no one is watching. Lockdown Mode reaching free and self-serve tiers (simonwillison.net) at least puts a clamp within reach of those unmanaged deployments, but only if someone turns it on. The action item is concrete. Audit which of your agents can send data outbound, to where, and shut off every path you cannot justify. Then turn on the egress control your platform offers. Lockdown Mode if you are on ChatGPT, the equivalent dial elsewhere.

Containment is where agent security was always headed

Step back and Lockdown Mode reads as a milestone in a predictable arc. The Molt Cycle describes how agent platforms move through rapid growth, then a security crisis, then hardening, then enterprise adoption. Prompt injection was the crisis. The detection era was the industry's first, failed attempt at hardening: throw classifiers at the input and hope. Lockdown Mode is the second attempt, and it is the right one, because it stops pretending the input boundary is defensible and moves enforcement to the boundary that is.

The timing fits the cycle. OpenAI teased this in February and shipped it in June (simonwillison.net), and it pushed the feature across consumer and self-serve business tiers in one move (simonwillison.net). That breadth is what hardening looks like when a platform is preparing for serious enterprise adoption. Egress control is table stakes for any organization that will let agents touch real data.

The lasting point is conceptual. Security people learned long ago that you cannot prevent every intrusion, so you contain blast radius. Networks get segmented. Processes get sandboxed. The assumption is that something will get in, and the work is making sure that what gets in cannot reach what matters. Agent security is arriving at the same place. You will not stop every prompt injection. You can make sure the injected agent has nowhere to send the data it steals.

Watch for two things next. First, whether egress control becomes a published, auditable allowlist rather than an opaque toggle, because security teams need to verify the boundary, not trust it. Second, whether competing harnesses match the move. The platform that makes containment granular and visible wins the enterprise, because that is what a buyer can actually defend in an audit. Detection was always going to lose. Containment was always going to win. Lockdown Mode is the industry finally saying so out loud.

/Figures

Detection vs. containment as prompt-injection defenses

Approach	Enforcement point	Signal quality	OpenAI's stance
Detection (input filtering)	Inbound: content entering the model	Poor: malicious and legitimate text look identical	Does not prevent injections appearing in processed content
Containment (egress control)	Outbound: requests leaving to the network	Clean: destination is allowed or not	Designed to prevent the final stage of exfiltration

Where each approach enforces, and why containment holds. Based on OpenAI's stated design for Lockdown Mode. Source

Lockdown Mode from tease to rollout

February
First teased
OpenAI previews Lockdown Mode.
June 5
Live and rolling out
Available to eligible Free, Go, Plus, Pro, and self-serve ChatGPT Business accounts.

OpenAI's path to shipping egress control. Source

/Sources

/Key Takeaways

Lockdown Mode does not stop prompt injection, by OpenAI's own admission. It stops the exfiltration that makes injection profitable.
Detection enforces at the input boundary, where the signal is hopeless. Containment enforces at the egress boundary, where it is clean. Containment is the defensible bet.
Prompt injection is a harness problem, not a model problem. A smarter model does not make a safer agent; controlling outbound actions does.
Audit every path your agents can send data outbound to, and shut off the ones you cannot justify. That outbound list is your real attack surface.
Egress control is hardening, the stage that precedes serious enterprise adoption. Watch for whether it becomes an auditable allowlist or stays an opaque toggle.

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

4 confirmed1 likely2 analysis

0/5 bound to a pack source

Confirmed
Lockdown Mode does not prevent prompt injections from appearing in the content ChatGPT processes, and an injection can arrive through cached web content or an uploaded file.
No matching pack item — claim recorded but not bound to a source.
Confirmed
OpenAI describes Lockdown Mode as designed to prevent the final stage of data exfiltration by limiting outbound network requests that could transfer sensitive data to an attacker.
No matching pack item — claim recorded but not bound to a source.
Confirmed
OpenAI is rolling Lockdown Mode out to eligible personal accounts including Free, Go, Plus, and Pro, plus self-serve ChatGPT Business accounts, after first teasing it in February.
No matching pack item — claim recorded but not bound to a source.
Confirmed
Analysts are framing AI products as systems of action versus systems of record, where action systems are the ones that do things in the world.
No matching pack item — claim recorded but not bound to a source.
Likely
Lockdown Mode limits outbound network requests rather than blocking all of them, making egress control a configurable dial rather than a binary switch.
No matching pack item — claim recorded but not bound to a source.
Analysis
Prompt injection is a harness problem because the defense lives in the layer connecting the model to the network, which Lockdown Mode governs without changing the model.
Analysis
Lockdown Mode reaching free and self-serve tiers puts an egress clamp within reach of unmanaged shadow-agent deployments.

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.