A model is easy to score: ask a question, grade the answer. An agent is not, because the unit of work is a trajectory, not an answer. It plans, calls tools, reacts to what it gets back, and may reach a good outcome by a bad path or a bad outcome by a plausible-looking one. That gap is why the most common agent failure is not a dramatic crash but a quiet one: a system that demos beautifully and then misbehaves on the long tail nobody evaluated.
Reliability is the real constraint, and it is a different thing from capability. A more capable model raises the ceiling; reliability decides how often you actually reach it without supervision. Measuring it means task-level success rates on realistic inputs, not a single impressive run, plus the cost of getting there. A 70 percent success rate that needs a human to catch the other 30 percent is a very different product from 99 percent, even if both demo identically.
The practical toolkit is reliability engineering applied to agents: per-step tracing so you can answer "why did it do that," evaluation sets that include the messy and adversarial cases, and metrics that pair success with cost (see the Agent Economics hub). This connects directly to security and orchestration: an agent you cannot trace is an agent you cannot trust, debug, or bound. Evaluation is not a launch gate you pass once; it is the instrument panel you run the agent on.