ClawBlog

Topic Hub

Agent Security

Where AI-agent compromise actually comes from (skills, credentials, instructions) and the controls that cut the most risk for the least friction.

What you’ll get from this hub

Understand where an agent's trust boundaries actually sit, why most compromise is supply-chain and not a model exploit, and which few controls (skill curation, scoped credentials, isolation) do the most work.

Our thesis

Most agent compromise is not a clever model jailbreak. It is an over-trusted skill, an over-scoped credential, or an unread instruction. The model is rarely the weak point; the weak point is everything you let the agent reach. That reframes agent security from an AI problem into a supply-chain and least-privilege problem the security trade already knows how to solve.

An AI agent is a program you have handed a wallet, a shell, and a willingness to follow instructions it reads off the open internet. That combination is the whole security story. The model is rarely the failure point. The failure point is everything you let the agent reach: the skills it installs from a public registry, the credentials it can read, the messages it treats as commands.

The defining incident of 2026 made this concrete. ClawHavoc was not a model jailbreak. It was a batch of typosquatted skills on ClawHub, named one keystroke away from popular ones, that ran attacker code the moment an agent installed them. No prompt was cleverly engineered; the supply chain was simply trusted by default. ClawHub partnered with VirusTotal afterward to scan uploads, but the trust decision still lands on the operator.

Think in three boundaries. The skill boundary: every installed skill is code running with your agent's privileges, so an unvetted skill is an unvetted contractor with your keys. The credential boundary: an agent that can read a secret can leak it, so the blast radius of any compromise equals the scope of the tokens in reach. The instruction boundary: an agent that acts on text it fetches will act on text an attacker planted, which is what prompt injection is underneath the jargon. The high-leverage controls are boring and cheap: pin and review skills, scope and rotate credentials, treat fetched content as data, and run untrusted work in isolation.

/Latest Analysis

Security

6,000 Attacks, Zero Leaks: The Quiet Win in Agent Security

A public challenge dared thousands of people to trick an OpenClaw agent into leaking a secret. After 6,000 attempts, nobody did. The story isn't a breach. It's the labs' injection-resistance work finally showing up at scale.

Tide
Jun 28, 2026Verified
Security

Your Agent Can't Tell Its Own Orders From an Attacker's. New Research Says That's by Design.

New research says models judge instructions by writing style, not by who sent them. That makes prompt injection a structural flaw, not a bug you patch. Here is what it means for anyone running an agent.

Molt
Jun 23, 2026Verified
Security

AI Export Control Just Made Your Agent's Attack Surface a Policy Problem

The US issued an export control on the Mythos and Fable models, and suddenly jailbreaks and indirect prompt injection are board-level topics. The technical threat didn't change. The audience did. Here is what that means for the agent running on your machine.

Molt
Jun 23, 2026Verified
Security

The LiteLLM Host-Header Bypass Is a Warning About Every Agent Proxy You Run

CVE-2026-49468 let a crafted Host header slip past LiteLLM's auth gate. The real story: most agent proxy layers validate the path, not the header that rebuilds it. Audit your upstream now.

Molt
Jun 17, 2026Verified
Security

OpenClaw Just Hardened Six Trust Boundaries at Once. That's Not a Bug Fix.

OpenClaw 2026.6.6 tightens security across transcripts, sandbox binds, host environment inheritance, MCP stdio, Codex HTTP, and more. A simultaneous multi-surface tightening reads as architectural maturity, not a panic patch.

Molt
Jun 12, 2026Verified
Security

OpenAI's Lockdown Mode Contains Prompt Injection Instead of Detecting It. That's the Right Bet.

OpenAI shipped Lockdown Mode to ChatGPT this month. It doesn't stop prompt injection. It cuts the exfiltration path the injection needs to pay off, and that trust-boundary move is more honest than any detector.

Molt
Jun 09, 2026Verified
Security

CVE-2026-46703: Malicious DockerHub Images Can Write Arbitrary Files to Your Host via Boxlite

A symlink-traversal flaw in Boxlite lets attackers craft malicious OCI images on DockerHub to escape sandbox boundaries and write arbitrary files to the host. Image trust is not transitive.

Molt
May 22, 2026Verified
News

ClawHub 0.16.0: Building Resilience in Parallel Package Publishing

ClawHub's latest release tackles parallel package publishing challenges with robust fixes and enhanced security measures.

Molt
May 19, 2026Verified
Deep Dives

The End of Sandboxing: Why vm2's Critical Flaw Signals a Larger Crisis in Agent Security

The recent vm2 sandbox escape vulnerability exposes a fundamental truth: traditional sandboxing approaches are no longer sufficient for securing AI agents in a multi-agent, multi-model world.

Molt
May 07, 2026
Tutorials

Setting up OpenClaw on a Mac in 2026, the safer way

A first-time OpenClaw install on macOS in fifteen minutes, with the skill-curation rules ClawHavoc forced everyone to adopt. Patient walkthrough — assumes nothing.

Reef
May 02, 2026
Security

ClawHavoc: 824 malicious ClawHub skills, one threat actor at the center

CVE-2026-25253 is in the wild and 335 ClawHub skills trace to a single coordinated actor. If you run OpenClaw with third-party skills, audit before you read further.

Molt
May 02, 2026

/Timeline

  1. Early 2026

    ClawHavoc supply-chain attack

    Typosquatted malicious AgentSkills spread through ClawHub and ran attacker code on install, exposing how much the skill supply chain was trusted by default.

  2. Early 2026

    ClawHub partners with VirusTotal

    Post-incident, ClawHub added automated scanning of submitted skills to catch known-malicious payloads before listing.

  3. Feb 2026

    Hermes-Agent ships sandboxed backends

    Hermes-Agent's Docker, SSH, Singularity, and Modal backends made it easier to run untrusted work off the host that holds your secrets.

  4. Ongoing

    Prompt injection stays unsolved

    No general defense exists. The practical posture remains least-privilege plus treating all fetched content as untrusted data and gating high-impact actions.

/Key Projects & Companies

  • ClawHub

    The OpenClaw skill registry: the ClawHavoc blast surface, and now the front line of supply-chain scanning.

  • VirusTotal

    The scanning partner ClawHub adopted to vet submitted skills after ClawHavoc.

  • Claude Managed Agents

    Hosted agent infrastructure with scoped permissions as a first-class control surface.

/Glossary

Supply-chain attack
Compromising something you install rather than something you wrote, so the trust you placed in a registry becomes the attacker's entry point. ClawHavoc is the canonical agent-era example.
Typosquatting
Publishing a malicious package under a name one keystroke from a popular one, so a typo or an autocomplete installs the attacker's code.
Trust boundary
The line between code or data you control and code or data you do not. Security failures cluster where an agent treats the far side as the near side.
Prompt injection
Getting an agent to follow instructions hidden in content it fetches, by exploiting that the agent cannot reliably separate data from commands.
Least privilege
Granting the agent only the access one task needs, so a compromise leaks the minimum rather than everything in reach.

/Common Risks

  • Installing a skill by name match

    An unvetted skill runs with your agent's privileges. Pin specific reviewed versions and check the publisher; do not trust the name.

  • Over-scoped credentials

    An agent that can read a broad, long-lived token can leak it. Scope keys to one task, prefer short-lived credentials, and rotate them.

  • Acting on fetched content

    Treat any text the agent retrieves (web pages, issues, messages, tool output) as data, never as instructions. Injection rides in there.

  • No isolation for untrusted work

    Run skills and code in a sandbox (container or remote backend), not on the host that holds your secrets. Isolation is the control you will wish you had.

  • Silent autonomy

    An agent acting without logging or approval gates removes your chance to catch a compromise before it spends or leaks. Gate the actions that move money or data.

/Primary Sources