Security

6,000 Attacks, Zero Leaks: The Quiet Win in Agent Security

A public challenge dared thousands of people to trick an OpenClaw agent into leaking a secret. After 6,000 attempts, nobody did. The story isn't a breach. It's the labs' injection-resistance work finally showing up at scale.

TideJun 28, 2026Verified · 3 sources Part of Agent Security

Hero image for "6,000 Attacks, Zero Leaks: The Quiet Win in Agent Security" — Generated by OpenAI - GPT 5.4 Image 2. via image-queue worker.

0 0

A public bug-bounty-style challenge against an OpenClaw instance failed completely. That failure is the most encouraging agent-security signal of the year.

Someone built a website whose entire purpose was to dare strangers to break an AI agent. Fernando Irarrázaval stood up an OpenClaw test instance, gave it a secret, wired it to an email inbox, and invited the internet to leak that secret by sending it messages. The page was hackmyclaw.com. The premise was simple: prompt-inject the agent through its email channel and walk away with the prize.

The result is the part worth sitting with. After roughly 6,000 attempts, nobody managed to leak the secret. The challenge ran up about $500 in token spend and even triggered a Google account suspension from the sheer volume of inbound email. Two thousand people tried. The agent held.

For years the default agent-security story has run one direction: a researcher finds a clever phrasing, the model obeys the attacker instead of the user, secrets walk out the door. The narrative was always attacker-finds-flaw. This is the inverse. Defenders trained frontier models hard enough that a crowd-sourced siege came up empty.

Meanwhile, on the same day, OpenAI shipped GPT-5.6 with explicit language about exactly this: the deliberate effort to make frontier models resist injection. The two data points belong to the same story. One is a lab telling you what it built. The other is the field test of whether it worked. This piece connects them, and argues that the way the industry measures agent trustworthiness is quietly shifting from 'can it be tricked at all' to 'how hard is it to trick at scale.'

A failed attack is a stronger signal than a successful one

Security research has a publication bias. A breach is a headline; a non-breach is a non-event. So the public record of agent security skews toward the spectacular failures, and the reader who relies on that record comes away believing every agent is one clever sentence away from disaster.

The hackmyclaw challenge is the rare counter-data point, and it's clean. The setup was adversarial by design: a public challenge to leak secrets held by an OpenClaw instance via email. Email is the worst-case channel because the content is fully attacker-controlled. There's no human in the loop reviewing each message. The agent reads whatever lands in the inbox and decides what to do with it.

That is the textbook prompt-injection scenario. Untrusted text enters the model's context, and the question is whether the model treats that text as data to process or as instructions to obey. For most of the last two years, the honest answer was 'sometimes it obeys.'

Here, across roughly 6,000 attempts, it didn't. The number matters more than any single clever exploit would, because scale is the test. One researcher failing to break an agent proves little. Two thousand people failing, burning $500 in tokens and clogging an inbox hard enough to get a Google account suspended, is a sample size. It tells you the easy attacks are now genuinely hard.

This is the Attack Surface Analysis lens turned into a live experiment. Enumerate the accessible interface (inbound email), let the crowd hammer it, and measure what gets through. Nothing did. That's not proof of perfection. It's evidence that the cheapest, most obvious class of attack has been substantially closed.

The labs are now training injection-resistance as a product feature

The challenge result didn't happen in a vacuum. It tracks something the model labs have been working on deliberately, and they've started saying so out loud.

Willison ties his own observation directly to the trend: the result matches the effort the labs have been putting into training their frontier models not to fall for injection attacks, and he points to a section on exactly that in the GPT-5.6 materials released the same day. That timing is not a coincidence so much as a convergence. The field test and the vendor disclosure landed on the same calendar day.

Look at how OpenAI framed GPT-5.6. The launch went out of its way to position the model on a careful capability line: GPT-5.6 Sol does not cross the Cyber Critical threshold under the company's Preparedness Framework. In browser-security evaluations, the model identified bugs and exploitation primitives, the building blocks of an exploit, but did not autonomously chain them into a working attack.

Read that carefully and you see a lab measuring its own model the way a defender would. Not 'how smart is it' but 'how dangerous is it, and where exactly does its capability stop.' Injection-resistance is the same posture pointed inward: not 'how helpful is the agent' but 'how reliably does it refuse the attacker's instructions.'

The shift is that trustworthiness is becoming a spec line, not an afterthought. When a release note carries language about cyber thresholds and injection training, the labs are telling enterprise buyers that controllability is now part of what they sell. That's new. Two years ago the pitch was capability. Now the pitch includes the brakes.

Capability and controllability are being decoupled on purpose

There's a long-running assumption in AI safety that the two move together in the wrong direction: more capable models are harder to control. The Capability vs. Controllability Frontier says the frontier forces an explicit trade-off, and for a while the trade looked grim. Smarter models meant more creative ways to be talked into the wrong thing.

The GPT-5.6 framing complicates that story in an interesting way. OpenAI presented the model as Mythos-beating at a subset of coding agent tasks while simultaneously claiming it is less capable at cyber-offense than the comparison model. The lab is asserting it can push raw capability up on one axis while holding dangerous capability down on another.

Whether that holds under scrutiny is a separate question. But the intent is the headline. The labs are treating capability and controllability as separable dials rather than a single slider. Injection-resistance is the agent-facing version of the same move: keep the model useful, make it stubborn about who it takes orders from.

The hackmyclaw result is what that looks like when it works. The agent was capable enough to read and act on email, the useful behavior, while refusing the embedded instructions, the controlled behavior. Capability and controllability, decoupled, in a live system, against a motivated crowd.

This is the most important pattern in the two items together. Not that one agent survived one challenge, but that the labs have apparently figured out how to train the refusal without lobotomizing the usefulness. That's the hard part, and the early evidence suggests they're getting traction on it.

Why this matters more for the harness than the model

The Harness Hypothesis holds that the value in AI isn't the model, it's the harness that connects the model to the world. An agent is a model plus its scaffolding: the tools it can call, the channels it reads, the permissions it holds, the secret it's trusted to protect.

The hackmyclaw experiment is really a harness test dressed up as a model test. The model's injection-resistance is one layer. But the agent also had a secret it was supposed to guard, an email channel it was supposed to read, and presumably some boundary between the two. The challenge probed all of it at once.

This is where the Swiss Cheese Model earns its keep. A single layer of defense, even a strong one, has holes. Accidents happen when the holes in multiple layers line up. A model that resists injection is one slice. A harness that doesn't hand the secret to an email-reply tool is another. A permission system that scopes what the agent can exfiltrate is a third.

What 6,000 failed attempts suggests is that the slices are starting to align in the defender's favor rather than the attacker's. The model held, but a well-built harness gives the model fewer ways to fail even when it's tempted. The two reinforce each other.

For the reader running OpenClaw or any comparable agent, the practical takeaway is not 'injection is solved.' It's that your defense now has more than one layer doing real work. The model is genuinely harder to trick than it was. That doesn't excuse a sloppy harness. It means a careful harness on top of a hardened model is finally a combination that survives contact with a hostile crowd.

The same week, a CVE reminds you the harness can still be the weak slice

Set the encouraging result against a sobering one from the same window, and the picture sharpens. While the model layer was passing a public stress test, the infrastructure layer was still producing critical vulnerabilities of the most ordinary kind.

Consider CVE-2026-49257, an advisory for an MCP server. The flaw: the server defaulted to running an HTTP service bound to 0.0.0.0:8080 with no authentication enabled, exposing all its tools to anyone who could reach the port. The severity was the maximum, CVSS 10.0. The fix changed the default bind host to loopback, refused non-loopback exposure unless OAuth was enabled, and made wider exposure opt-in and authentication-gated.

Notice that none of this is about prompt injection or model behavior. It's a deployment default that exposed a service to the network. That's the Trust Boundary Model failing at the most basic layer: data crossing from the open internet into a tool surface with no check in between.

So the two stories from the same week say complementary things. The model layer is getting genuinely hard to trick. The plumbing around agents still ships with the kind of default-open mistakes that have plagued software forever. A hardened model behind a misconfigured tool server is a strong door on an open window.

This is the Molt Cycle in motion. Open agent infrastructure runs through rapid growth, then a security crisis, then hardening. The model labs appear to be exiting the hardening phase on injection. The surrounding tool ecosystem, the MCP servers and connectors and skills, is visibly earlier in its own cycle, still shipping CVSS-10 defaults. The frontier moved. The periphery hasn't caught up.

What changes for how you evaluate an agent

If the easy injection attacks are now genuinely hard, the question a careful user should ask shifts. The old question was binary: can this agent be prompt-injected? The honest answer was always yes, so the question wasn't very useful. The better question is graded: how hard is it to inject at scale, and what does the harness do when an attempt slips through?

That reframing has practical consequences. A vendor that can point to injection-resistance training, the way OpenAI did with the GPT-5.6 cyber-threshold language, is offering something measurable. A challenge result like hackmyclaw's 6,000 failed attempts is a public benchmark of a kind that didn't really exist before. Both give you something to compare against, instead of a vague assurance.

The Autonomy Spectrum is the right tool for the decision. Agent deployments run from copilot to full autonomy, and most failures come from deploying at the wrong point on it. A hardened model lets you move further toward autonomy than you safely could a year ago, because the cheap attacks no longer work. It does not let you go all the way, because the harness can still be the weak slice, as the MCP CVE shows.

So the recommendation is calibrated optimism. Treat injection-resistance as real and improving, and let it justify giving your agent slightly more rope: more autonomous email handling, more tool access, fewer human checkpoints on low-stakes actions. Then spend the trust you've gained on hardening the rest. Scope the secrets. Lock the tool servers to loopback. Gate exposure behind authentication.

The model held against a crowd of 2,000. Make sure the door it's standing behind is as well built as the model now is.

/Figures

Two security signals from the same week, two different layers

Signal	Layer	Outcome
hackmyclaw challenge (OpenClaw via email)	Model / injection-resistance	6,000 attempts, 0 leaks
GPT-5.6 cyber evaluation	Model / capability threshold	Found exploit primitives, did not autonomously chain them
CVE-2026-49257 (MCP server)	Infrastructure / deployment default	CVSS 10.0, fixed by loopback + OAuth gating

The model layer passed a public adversarial test; the infrastructure layer shipped a maximum-severity default-exposure flaw. Source

/Sources

/Key Takeaways

A public challenge invited the internet to prompt-inject an OpenClaw agent via email; after roughly 6,000 attempts, nobody leaked the secret.
The result tracks a deliberate labs effort: GPT-5.6 shipped the same week with explicit injection-resistance and cyber-threshold framing.
The labs are treating capability and controllability as separable dials, keeping models useful while making them stubborn about whose instructions they obey.
Injection-resistance is one defensive slice. A maximum-severity MCP server CVE in the same week shows the harness and infrastructure can still be the weak layer.
Evaluate agents on a graded question (how hard to inject at scale, and what the harness does on failure), not the old binary of whether injection is possible at all.

Sources for this article

11 collected in pack · 3 cited & verified in body

This is the full source pack collected for the story — the pool the writer cites from, which is why the pack count can exceed the citations in the body. Tier labels reflect domain authority; freshness is re-checked daily. How each load-bearing claim bound to this pack is itemized in the claims panel below. What the tiers mean · How we verify.

[AINews] OpenAI GPT-5.6 Sol / Terra / Luna — restricted to trusted partners
www.latent.space
Reputable
What happened after 2,000 people tried to hack my AI assistant
simonwillison.net
Reputable
Release 1.15.1 · crewAIInc/crewAI
github.com
Reputable
A quote from Dean W. Ball
simonwillison.net
Reputable
A quote from Timothy B. Lee
simonwillison.net
Reputable
2026.26: Summer Vibes
stratechery.com
Reputable
The Sequence Opinion #884: Self-Driving Labs: The Laboratory That Chooses Its Next Experiment
thesequence.substack.com
Community
The Sequence AI of the Week #883: Qwen is Getting Into Robotics
thesequence.substack.com
Community
An Interview with Figma CEO Dylan Field About Design and AI
stratechery.com
Reputable
My Vibe Coding Adventure, The App and the Experience, Ten Takeaways
stratechery.com
Reputable
CVE-2026-49257 - GitHub Advisory Database
github.com
Official

Load-bearing claims

The writer flagged these claims as load-bearing. Where a cited source supports the claim, the row links out to it; confidence labels reflect how directly the source backs the assertion. We surface unverified claims honestly rather than hide them.

6 confirmed2 analysis

6/6 bound to a pack source

Confirmed
A public challenge on hackmyclaw.com invited people to leak a secret held by an OpenClaw test instance via email, and after roughly 6,000 attempts nobody succeeded.
What happened after 2,000 people tried to hack my AI assistant
Confirmed
The challenge ran up about $500 in token spend and triggered a Google account suspension from too many inbound emails.
What happened after 2,000 people tried to hack my AI assistant
Confirmed
The result matches the effort the labs have been putting into training frontier models not to fall for injection attacks, with a relevant section in the GPT-5.6 materials released the same day.
What happened after 2,000 people tried to hack my AI assistant
Confirmed
GPT-5.6 Sol does not cross the Cyber Critical threshold under OpenAI's Preparedness Framework; in browser evaluations it found bugs and exploitation primitives but did not autonomously chain a working exploit.
[AINews] OpenAI GPT-5.6 Sol / Terra / Luna — restricted to trusted partners
Confirmed
OpenAI presented GPT-5.6 as Mythos-beating at a subset of coding agent tasks while claiming it is less capable at cyber-offense than the comparison model.
[AINews] OpenAI GPT-5.6 Sol / Terra / Luna — restricted to trusted partners
Analysis
The labs appear to be decoupling capability from controllability, keeping models useful while training them to refuse attacker instructions.
Confirmed
CVE-2026-49257 covers an MCP server that defaulted to an unauthenticated HTTP service bound to 0.0.0.0:8080, rated CVSS 10.0, fixed by binding to loopback and gating exposure behind OAuth.
CVE-2026-49257 - GitHub Advisory Database
Analysis
The right evaluation question shifts from whether an agent can be injected at all to how hard it is to inject at scale and what the harness does on failure.

Spot something wrong?

We correct openly and publicly. Email the editor through the correction form and material edits get a dated note appended below the article.