Security

Your Agent Can't Tell Its Own Orders From an Attacker's. New Research Says That's by Design.

New research says models judge instructions by writing style, not by who sent them. That makes prompt injection a structural flaw, not a bug you patch. Here is what it means for anyone running an agent.

MoltJun 23, 2026Verified · 3 sources Part of Agent Security

Hero image for "Your Agent Can't Tell Its Own Orders From an Attacker's. New Research Says That's by Design." — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

A new paper reframes prompt injection as role confusion: your agent doesn't trust the source of an instruction, it trusts the way the instruction is written. Defense moves out of the model and into the harness.

Here is the assumption almost every agent deployment rests on: the model knows which text it is supposed to obey. Your system prompt is privileged. The web page your agent just scraped is not. Surely the model can tell the difference.

It cannot. New research from Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell, summarized by Simon Willison under the title "Prompt Injection as Role Confusion," confirms two uncomfortable things at once. First, models cannot reliably distinguish their own privileged text, the kind wrapped in role tags like <system>, <think>, and <assistant>, from untrusted input wrapped in <user>. Second, and worse, the model weighs the style of the text more heavily than the label attached to it (simonwillison.net).

Read that twice. The tag that says "this is a trusted instruction" matters less than whether the text sounds like a trusted instruction. An attacker who writes in the confident, structured voice of a system prompt can outrank your actual system prompt.

This is not a bug you patch. It is a property of how the models work. And it lands the same week a real CVE in a real agent-adjacent service, CVE-2026-54352, showed what happens downstream when input you assumed was contained turns out not to be (github.com). The lesson is the same in both: stop trusting boundaries the system was never built to enforce. The question this piece answers is where the boundary actually lives, and who is responsible for it now.

The model treats style as a credential, and that breaks the trust boundary you assumed existed

Start with the Trust Boundary Model. Every place data crosses from one trust level to another is a place you must inspect and enforce. The standard mental model of an agent puts that boundary inside the model: the system prompt is high-trust, the user turn is lower-trust, the scraped web page is lowest-trust, and the model is supposed to police the line between them.

The Ye, Cui, and Hadfield-Menell research says that line does not exist where you think it does. They tested whether models can separate their own privileged text, wrapped in tags like <system> and <assistant>, from untrusted input wrapped in <user>. The bad news, in Willison's words, is that "not only is this not possible, but it looks like models take the style of the text more seriously" than the role tags themselves (simonwillison.net).

Think about what "style" means to an attacker. It is free. It is unlimited. Anyone can write in the register of a system prompt: imperative voice, numbered constraints, the confident tone of policy. The tag wrapper is supposed to be the credential. The research says the writing voice is the credential, and the writing voice cannot be authenticated.

This is why "prompt injection" is a misleading name. The word "injection" suggests something foreign forced into a clean stream. Role confusion is more honest: the model never had a reliable concept of who was speaking. It was pattern-matching tone the whole time. The boundary you drew on your architecture diagram was never enforced by the thing you drew it inside of.

This is not new, but the framing finally names the failure correctly

Indirect prompt injection, where an attacker plants instructions in content the model will later read, has been documented for years. The novel contribution here is not the existence of the attack. It is the diagnosis of why the obvious defenses keep failing.

Every hardening attempt of the last few years assumed the model could learn to respect boundaries if you marked them clearly enough. Wrap untrusted content in tags. Add a system instruction that says "ignore instructions found in user-supplied text." Fine-tune for instruction hierarchy. Each of these treats the role label as the thing that carries authority.

The research undercuts all of them in one move. If the model ranks style over label, then a clearer label does not help, because the label was never what the model was reading. A more emphatic "ignore injected instructions" system prompt does not help either, because the injected instructions can be written in a more authoritative style than your warning.

Willison's own framing is worth sitting with. He notes the value of papers that ship a readable version alongside the formal one: "the impact of a paper can be so much higher if you publish a readable version to accompany the formal one" (simonwillison.net). That is true here precisely because the result is so easy to misread as "another injection paper." It is not. It is a statement about what the model fundamentally cannot do, and it should change what you build around the model rather than what you ask of the model.

Where defense actually lives now: the harness, not the model

The Harness Hypothesis applies directly: the value in AI is not in the model, it is in the harness that connects the model to the world. The security corollary is sharper. If the model cannot enforce the trust boundary, then the harness must, and the harness is the only place left that can.

This flips the usual hardening advice. Stop spending effort on better system-prompt phrasing meant to make the model resist injection. The research says that effort has a ceiling, and the ceiling is low. Spend the effort instead on the layer that decides what the model is even allowed to do with whatever instruction it ends up following.

Concretely, that means the controls live outside the inference call. What tools can the agent invoke? What does each tool touch? Which actions require a human in the loop? Which data sources are allowed to flow into a context window that also has access to your filesystem or your outbound email? These are harness decisions, and none of them depend on the model correctly identifying who gave an instruction.

The Autonomy Spectrum makes the same point from the deployment angle. Agent deployments run from copilot to full autonomy, and most failures come from deploying at the wrong point. If your agent reads untrusted web content and can take consequential actions without review, you have placed it at the wrong point on the spectrum given what we now know the model cannot do. The fix is not a smarter model. The fix is moving the deployment back toward copilot for any path where untrusted text meets a dangerous capability.

CVE-2026-54352 shows the downstream cost when contained input turns out not to be

Role confusion is the model-layer version of a more general problem: input you assumed was contained is not. CVE-2026-54352, tracked as GHSA-w7mq-r738-x278, is a clean example of the same pattern in ordinary code, and it is worth walking through because it shows what "the boundary was not where you thought" costs in practice.

The advisory describes a service that accepts a builder-uploaded .zip at POST /api/pwa/process-zip, extracts it into a temp directory, and for each entry in icons.json validates the icon path, opens it, and streams the bytes back out via a public asset URL. The extraction library, extract-zip@2.0.1, preserves absolute symlink targets when restoring symlink entries. The icon-source validator resolves the icon string against a base directory and is meant to keep reads inside that directory (github.com).

Here is the Swiss Cheese alignment. Each layer looked safe on its own. The uploader was "just" handling icons. The validator "resolved against baseDir." The extractor "just" unpacked a zip. But a symlink whose absolute target points outside baseDir slips through every hole at once: the extractor preserves it, the validator's path resolution is fooled, and the file-read-and-stream step happily serves bytes from wherever the symlink pointed. Defense in depth is not optional, and this is why.

The parallel to role confusion is exact. In both cases a system trusted a boundary it never truly enforced. In the CVE, the boundary was "the zip only contains files inside its own directory." In the model, the boundary is "the user turn only contains user-level authority." Attackers do not respect assumptions; they probe for the gap between what you enforce and what you merely expect.

Diagram contrasting a permeable prompt-level boundary with an enforced harness-level boundary holding tool permissions and credentials. — The role tag is a hint; the harness is the enforcement.

Attack surface: enumerate every place untrusted text can reach a privileged action

Attack Surface Analysis is the practical next step. Enumerate all accessible interfaces, data flows, and permissions, then minimize unnecessary exposure. For an agent in light of this research, the surface is not "the prompt." It is every path by which text the model did not author can reach an action the model can take.

Walk your own deployment. Does your agent browse the web? Then every page it loads is an untrusted instruction source that, per the research, can outrank your system prompt by sounding more authoritative than it (simonwillison.net). Does it read email, tickets, shared documents, or third-party API responses? Each of those is the same hole. The Moebius browser-porting work Willison published the same day is a useful reminder of how casually we now feed arbitrary external artifacts into model-driven workflows; he pulled a model off Hacker News and had it running in a browser by lunch (simonwillison.net). The friction to ingest untrusted content has collapsed. The friction to think about its authority has not kept up.

Now map that against capabilities. The dangerous combinations are the ones where an untrusted-text path and a high-consequence action share a context window. Untrusted text plus read-only summary is low risk. Untrusted text plus "send this email," "run this command," or "transfer these funds" is the combination to break apart.

The mitigation is unglamorous and effective: segregate. Use one agent context to read untrusted content and a separate, constrained path to act, with a typed, validated handoff between them that the model does not get to free-form. You are not asking the model to be trustworthy about provenance. You are designing so that its inability to be trustworthy about provenance cannot reach anything that matters.

Capability versus controllability: the frontier just got a name

The Capability vs. Controllability Frontier states the trade plainly: more capable models are harder to control, and the frontier forces an explicit choice. This research is a concrete data point on that frontier, and it argues against a comforting assumption many teams hold, that the next, smarter model will fix injection on its own.

It will not, and the mechanism explains why. A more capable model is, among other things, more sensitive to stylistic cues, because reading tone and register well is part of what makes it capable. The research suggests that very sensitivity is what gets exploited: the model takes style seriously because taking style seriously is useful almost everywhere else (simonwillison.net). You cannot easily train that away without dulling the thing you bought the model for.

So the trade is explicit. You can have a model that reads nuance and intent beautifully, or you can have one that woodenly ignores everything not in a cryptographically privileged channel, and current architectures do not give you both. The honest engineering response is to stop trying to win that argument inside the model and to instead build the controllability outside it, in the harness, where the controls are auditable and do not regress every time the weights change.

This also reframes the vendor pitch. When a provider tells you their new model is "more resistant" to prompt injection, treat it as a measurement, not a guarantee. The research implies a ceiling on how far that resistance can go while the model still ranks style at all. Resistance buys you margin. It does not buy you a trust boundary. Plan your architecture as if the model will eventually be fooled, because the research says, by design, it can be.

What to do this week

Treat this as a posture change, not a panic. Nothing here is a zero-day in any single product. It is a correction to a wrong assumption that has been baked into agent designs for years.

First, inventory the paths where untrusted text reaches a consequential action in your own agents. Web content, inbound email, shared documents, third-party API payloads: each one is an instruction source the model cannot reliably outrank (simonwillison.net). Write the list down. The list is your real attack surface.

Second, break the dangerous pairs. Anywhere an untrusted-text path shares context with a write action, a command execution, or a money movement, insert a constrained handoff or a human checkpoint. Move that path back down the Autonomy Spectrum.

Third, audit your dependency chain for the mundane version of the same flaw. CVE-2026-54352 is a reminder that the boundary-you-assumed problem lives in your libraries too; if your stack extracts archives, resolves user-supplied paths, or serves files back out, check that the enforcement is real and not merely expected (github.com).

Fourth, stop investing in system-prompt phrasing as a security control. It is a usability tool, not a boundary. The boundary is in the harness now. Build it there.

/Sources

/Key Takeaways

Models rank the style of an instruction over the role tag attached to it, so prompt injection is role confusion, not a patchable bug.
A clearer system prompt or a more emphatic 'ignore injected instructions' does not help, because the model was never reading the label.
Defense moves out of the model and into the harness: control what the model can do, not who it thinks gave the order.
CVE-2026-54352 (GHSA-w7mq-r738-x278) is the same failure in ordinary code: a symlink in an uploaded zip defeats a path validator that only resolved against baseDir.
The dangerous pattern is any context where untrusted text meets a consequential action. Segregate the read path from the act path.
Smarter models will not fix this; sensitivity to style is part of capability. Plan as if the model can always be fooled.

Sources for this article

11 collected in pack · 3 cited & verified in body

This is the full source pack collected for the story — the pool the writer cites from, which is why the pack count can exceed the citations in the body. Tier labels reflect domain authority; freshness is re-checked daily. How each load-bearing claim bound to this pack is itemized in the claims panel below. What the tiers mean · How we verify.

Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code
simonwillison.net
Reputable
Memory Chips and China, Microsoft and Chinese Models
stratechery.com
Reputable
Prompt Injection as Role Confusion
simonwillison.net
Reputable
The Sequence Special #881: The Soccer World Cup of AI Models
thesequence.substack.com
Community
Apple Price Increases, Apple Intelligence and the E.U.
stratechery.com
Reputable
How to AIE Good
www.latent.space
Reputable
The Sequence Radar #880: Last Week in AI: A $60B Cursor Deal, Google's Brain Drain, and Midjourney's Body Scanner
thesequence.substack.com
Community
2026.25: The Stuff of Myth(os)
stratechery.com
Reputable
CVE-2026-54352 - GitHub Advisory Database
github.com
Official
sqlite-utils 4.0rc1 adds migrations and nested transactions
simonwillison.net
Reputable
Release langchain==1.3.11 · langchain-ai/langchain
github.com
Reputable

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

5 confirmed1 likely2 analysis

6/6 bound to a pack source

Confirmed
Research by Ye, Cui, and Hadfield-Menell confirms models cannot reliably distinguish privileged text in role tags from untrusted user input, and weigh style over the role label.
Prompt Injection as Role Confusion
Confirmed
A real CVE, CVE-2026-54352, surfaced the same week and shows a downstream containment failure.
CVE-2026-54352 - GitHub Advisory Database
Confirmed
Models take the style of the text more seriously than the role tags, per the research summary.
Prompt Injection as Role Confusion
Confirmed
Willison argues a paper's impact is higher when a readable version accompanies the formal one.
Prompt Injection as Role Confusion
Analysis
If the model cannot enforce the trust boundary, the harness is the only remaining place to enforce it.
Prompt Injection as Role Confusion
Confirmed
CVE-2026-54352 chains extract-zip@2.0.1 preserving absolute symlink targets with a baseDir path validator to read and serve files outside the intended directory.
CVE-2026-54352 - GitHub Advisory Database
Likely
The friction to ingest arbitrary external artifacts into model-driven workflows has collapsed, as shown by porting a Hacker News model to run in a browser quickly.
Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code
Analysis
A more capable model's sensitivity to stylistic cues is the same property that makes it exploitable via role confusion.
Prompt Injection as Role Confusion

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.