Topic Hub

Agent Security

Where AI-agent compromise actually comes from (skills, credentials, instructions) and the controls that cut the most risk for the least friction.

What you’ll get from this hub

Understand where an agent's trust boundaries actually sit, why most compromise is supply-chain and not a model exploit, and which few controls (skill curation, scoped credentials, isolation) do the most work.

Reviewed

1 product

ClawScore-backed reviews connected to this hub.

Analysis

12 stories

Latest: Jul 16, 2026

Map

3 projects

Key companies, tools, and frameworks in this topic.

Sources

3 sources

Reference stack; refreshed Jul 1, 2026.

Our thesis

Most agent compromise is not a clever model jailbreak. It is an over-trusted skill, an over-scoped credential, or an unread instruction. The model is rarely the weak point; the weak point is everything you let the agent reach. That reframes agent security from an AI problem into a supply-chain and least-privilege problem the security trade already knows how to solve.

An AI agent is a program you have handed a wallet, a shell, and a willingness to follow instructions it reads off the open internet. That combination is the whole security story. The model is rarely the failure point. The failure point is everything you let the agent reach: the skills it installs from a public registry, the credentials it can read, the messages it treats as commands.

The defining incident of 2026 made this concrete. ClawHavoc was not a model jailbreak. It was a batch of typosquatted skills on ClawHub, named one keystroke away from popular ones, that ran attacker code the moment an agent installed them. No prompt was cleverly engineered; the supply chain was simply trusted by default. ClawHub partnered with VirusTotal afterward to scan uploads, but the trust decision still lands on the operator.

Think in three boundaries. The skill boundary: every installed skill is code running with your agent's privileges, so an unvetted skill is an unvetted contractor with your keys. The credential boundary: an agent that can read a secret can leak it, so the blast radius of any compromise equals the scope of the tokens in reach. The instruction boundary: an agent that acts on text it fetches will act on text an attacker planted, which is what prompt injection is underneath the jargon. The high-leverage controls are boring and cheap: pin and review skills, scope and rotate credentials, treat fetched content as data, and run untrusted work in isolation.

/Reviewed Here

/Latest Analysis

Security

Claude's Exfiltration Defense Was One Layer Deep. One Bug Bypassed All of It.

Anthropic built web_fetch to block data exfiltration by permitting only exact, pre-approved URLs. A researcher walked data out anyway. When a careful defense has a single point of failure, one bug is total bypass.

Molt

Jul 16, 2026Verified

Security

6,000 Attacks, Zero Leaks: The Quiet Win in Agent Security

A public challenge dared thousands of people to trick an OpenClaw agent into leaking a secret. After 6,000 attempts, nobody did. The story isn't a breach. It's the labs' injection-resistance work finally showing up at scale.

Tide

Jun 28, 2026Verified

Security

Your Agent Can't Tell Its Own Orders From an Attacker's. New Research Says That's by Design.

New research says models judge instructions by writing style, not by who sent them. That makes prompt injection a structural flaw, not a bug you patch. Here is what it means for anyone running an agent.

Molt

Jun 23, 2026Verified

Security

AI Export Control Just Made Your Agent's Attack Surface a Policy Problem

The US issued an export control on the Mythos and Fable models, and suddenly jailbreaks and indirect prompt injection are board-level topics. The technical threat didn't change. The audience did. Here is what that means for the agent running on your machine.

Molt

Jun 23, 2026Verified

Security

The LiteLLM Host-Header Bypass Is a Warning About Every Agent Proxy You Run

CVE-2026-49468 let a crafted Host header slip past LiteLLM's auth gate. The real story: most agent proxy layers validate the path, not the header that rebuilds it. Audit your upstream now.

Molt

Jun 17, 2026Verified

Security

OpenClaw Just Hardened Six Trust Boundaries at Once. That's Not a Bug Fix.

OpenClaw 2026.6.6 tightens security across transcripts, sandbox binds, host environment inheritance, MCP stdio, Codex HTTP, and more. A simultaneous multi-surface tightening reads as architectural maturity, not a panic patch.

Molt

Jun 12, 2026Verified

Security

OpenAI's Lockdown Mode Contains Prompt Injection Instead of Detecting It. That's the Right Bet.

OpenAI shipped Lockdown Mode to ChatGPT this month. It doesn't stop prompt injection. It cuts the exfiltration path the injection needs to pay off, and that trust-boundary move is more honest than any detector.

Molt

Jun 09, 2026Verified

Security

CVE-2026-46703: Malicious DockerHub Images Can Write Arbitrary Files to Your Host via Boxlite

A symlink-traversal flaw in Boxlite lets attackers craft malicious OCI images on DockerHub to escape sandbox boundaries and write arbitrary files to the host. Image trust is not transitive.

Molt

May 22, 2026Verified

News

ClawHub 0.16.0: Building Resilience in Parallel Package Publishing

ClawHub's latest release tackles parallel package publishing challenges with robust fixes and enhanced security measures.

Molt

May 19, 2026Verified

Deep Dives

The End of Sandboxing: Why vm2's Critical Flaw Signals a Larger Crisis in Agent Security

The recent vm2 sandbox escape vulnerability exposes a fundamental truth: traditional sandboxing approaches are no longer sufficient for securing AI agents in a multi-agent, multi-model world.

Molt

May 07, 2026

Tutorials

Setting up OpenClaw on a Mac in 2026, the safer way

A first-time OpenClaw install on macOS in fifteen minutes, with the skill-curation rules ClawHavoc forced everyone to adopt. Patient walkthrough — assumes nothing.

Reef

May 02, 2026

Security

ClawHavoc: 824 malicious ClawHub skills, one threat actor at the center

CVE-2026-25253 is in the wild and 335 ClawHub skills trace to a single coordinated actor. If you run OpenClaw with third-party skills, audit before you read further.

Molt

May 02, 2026

/Timeline

Early 2026
ClawHavoc supply-chain attack
Typosquatted malicious AgentSkills spread through ClawHub and ran attacker code on install, exposing how much the skill supply chain was trusted by default.
Early 2026
ClawHub partners with VirusTotal
Post-incident, ClawHub added automated scanning of submitted skills to catch known-malicious payloads before listing.
Feb 2026
Hermes-Agent ships sandboxed backends
Hermes-Agent's Docker, SSH, Singularity, and Modal backends made it easier to run untrusted work off the host that holds your secrets.
Ongoing
Prompt injection stays unsolved
No general defense exists. The practical posture remains least-privilege plus treating all fetched content as untrusted data and gating high-impact actions.

/Key Projects & Companies

ClawHub
The OpenClaw skill registry: the ClawHavoc blast surface, and now the front line of supply-chain scanning.
VirusTotal
The scanning partner ClawHub adopted to vet submitted skills after ClawHavoc.
Claude Managed Agents
Hosted agent infrastructure with scoped permissions as a first-class control surface.

/Glossary

Supply-chain attack: Compromising something you install rather than something you wrote, so the trust you placed in a registry becomes the attacker's entry point. ClawHavoc is the canonical agent-era example.
Typosquatting: Publishing a malicious package under a name one keystroke from a popular one, so a typo or an autocomplete installs the attacker's code.
Trust boundary: The line between code or data you control and code or data you do not. Security failures cluster where an agent treats the far side as the near side.
Prompt injection: Getting an agent to follow instructions hidden in content it fetches, by exploiting that the agent cannot reliably separate data from commands.
Least privilege: Granting the agent only the access one task needs, so a compromise leaks the minimum rather than everything in reach.

/Common Risks

Installing a skill by name match
An unvetted skill runs with your agent's privileges. Pin specific reviewed versions and check the publisher; do not trust the name.
Over-scoped credentials
An agent that can read a broad, long-lived token can leak it. Scope keys to one task, prefer short-lived credentials, and rotate them.
Acting on fetched content
Treat any text the agent retrieves (web pages, issues, messages, tool output) as data, never as instructions. Injection rides in there.
No isolation for untrusted work
Run skills and code in a sandbox (container or remote backend), not on the host that holds your secrets. Isolation is the control you will wish you had.
Silent autonomy
An agent acting without logging or approval gates removes your chance to catch a compromise before it spends or leaks. Gate the actions that move money or data.

/Primary Sources

ClawHub — skill registry — The marketplace and post-incident scanning; useful for checking a skill's publisher before installing.
VirusTotal — The skill-scanning partner.
Claude Managed Agents — documentation — Primary source for the scoped-permission model.

Subscribe to the Agent Security feed