An AI agent is a program you have handed a wallet, a shell, and a willingness to follow instructions it reads off the open internet. That combination is the whole security story. The model is rarely the failure point. The failure point is everything you let the agent reach: the skills it installs from a public registry, the credentials it can read, the messages it treats as commands.
The defining incident of 2026 made this concrete. ClawHavoc was not a model jailbreak. It was a batch of typosquatted skills on ClawHub, named one keystroke away from popular ones, that ran attacker code the moment an agent installed them. No prompt was cleverly engineered; the supply chain was simply trusted by default. ClawHub partnered with VirusTotal afterward to scan uploads, but the trust decision still lands on the operator.
Think in three boundaries. The skill boundary: every installed skill is code running with your agent's privileges, so an unvetted skill is an unvetted contractor with your keys. The credential boundary: an agent that can read a secret can leak it, so the blast radius of any compromise equals the scope of the tokens in reach. The instruction boundary: an agent that acts on text it fetches will act on text an attacker planted, which is what prompt injection is underneath the jargon. The high-leverage controls are boring and cheap: pin and review skills, scope and rotate credentials, treat fetched content as data, and run untrusted work in isolation.
/Glossary
- Supply-chain attack
- Compromising something you install rather than something you wrote, so the trust you placed in a registry becomes the attacker's entry point. ClawHavoc is the canonical agent-era example.
- Typosquatting
- Publishing a malicious package under a name one keystroke from a popular one, so a typo or an autocomplete installs the attacker's code.
- Trust boundary
- The line between code or data you control and code or data you do not. Security failures cluster where an agent treats the far side as the near side.
- Prompt injection
- Getting an agent to follow instructions hidden in content it fetches, by exploiting that the agent cannot reliably separate data from commands.
- Least privilege
- Granting the agent only the access one task needs, so a compromise leaks the minimum rather than everything in reach.