A computer-use agent operates a screen the way a person does: it looks at the pixels, moves the mouse, clicks, and types. A browser agent is the common special case, driving a web browser to navigate, fill forms, and read pages. The appeal is generality. Most software has no clean API, but almost all of it has a GUI, so an agent that can use a GUI can in principle use anything, with zero per-app integration.
That generality is also the catch. An agent with mouse, keyboard, and screen access has the broadest reach of any agent pattern: it can act in any open application, see whatever is on screen (including secrets and credentials it types), and take irreversible actions a click at a time. It also runs on the most brittle substrate in the stack. UIs move, layouts change, a modal appears at the wrong moment, and the agent that worked yesterday misclicks today. Reliability, which is already the hard problem for agents, is hardest here.
The security model follows from the reach. On-screen content is untrusted input, so a malicious page or document can carry a prompt injection straight into an agent that is reading the screen and able to act on it. The defenses are the familiar ones turned up a notch: run the agent in an isolated environment (a dedicated VM or sandboxed browser, never your daily machine), scope what it can reach, gate the irreversible actions behind approval, and treat every screen it reads as hostile until proven otherwise. Anthropic and OpenAI both shipped computer-using agents, which moved this from research demo to something operators actually have to threat-model.