Phoenix's Custom Eval Functions Reveal What Every Agent Framework Quietly Admits: Fixed Rubrics Don't Work

Phoenix's new server-side eval functions aren't just a feature addition. They're a quiet confession that preset judge templates have been failing production teams for a while.

Here is the uncomfortable premise buried inside Phoenix v16.0.0: the eval tooling that most teams have been shipping with was never fit for purpose. Arize doesn't say that directly. The release notes do it for them.

"Most eval tooling hands you a fixed menu of judge templates," the Phoenix v16.0.0 announcement reads. "Real evaluation is rarely that tidy."

That's not a feature pitch. That's a diagnosis. The industry spent two years building agent evaluation frameworks anchored to preset rubrics: did the agent cite a source, did the response stay under a token budget, did the output match a known-good string. Those metrics were measurable. They were also, in many production contexts, almost useless. An agent that answers a customer support question correctly but in the wrong tone, or that synthesizes a legal summary accurately but omits a crucial caveat, doesn't fail any standard template check.

Phoenix's Code Evaluators let users write a custom scoring function directly in the product UI, in Python or TypeScript, with no local runtime and no deployment step. Phoenix runs it server-side and stamps every experiment run with the resulting labels and scores. The friction between "I need a bespoke evaluation criterion" and "I have a bespoke evaluation criterion running in production" collapses to the time it takes to write a function.

Meanwhile, the rest of the agent ecosystem is shipping background session persistence, hooks systems, and streaming improvements. All useful. None of them touches the question that actually determines whether an agent is trustworthy: how do you know it's doing the right thing? Phoenix is the one framework this week that tried to answer that. The answer reveals more about the ecosystem's problems than the feature itself.

The preset rubric era is ending because production agents are too varied for it

Template-based evaluation made sense when agents were narrow. A retrieval agent that always answers questions about a fixed corpus, a classifier that routes support tickets, a summarizer with a well-defined domain: for these, a fixed rubric works because the output space is bounded. You can write a judge prompt that asks "did the response stay on topic" and that question has a stable meaning.

Production agents in 2026 don't look like that. They span multiple tools, compose partial outputs across several steps, and operate in domains where the correct answer depends on context that no static template captures. The Autonomy Spectrum that analysts use to map agent deployments is stretching toward its upper end, and the evaluation infrastructure has not kept pace. A copilot that drafts one email is easy to check. An agent that runs a multi-step research workflow, synthesizes findings across sources, and writes a client memo is not.

This is the problem Phoenix is acknowledging. According to the v16.0.0 release notes, Code Evaluators support composite scoring, described as blending "sub-scores (LLM judgment + deterministic rules) into one weighted metric." They also support embedding-based evaluation using cosine similarity rather than string matching, and explicit LLM-as-judge patterns. That's not a feature menu. That's an admission that the evaluation space has three distinct regimes, and that teams regularly need all three at once, weighted according to their specific use case.

Meanwhile, none of this complexity is addressed by a fixed template. A template can ask an LLM judge whether the output was helpful. It cannot weight helpfulness at 40%, factual accuracy at 40%, and tone-appropriateness at 20%, then apply that formula consistently across ten thousand experiment runs. A custom function can. That gap, between what template evals can measure and what production teams actually need to measure, is precisely what Code Evaluators close.

The pattern resembles a dynamic that appears repeatedly in developer tooling: a category starts with opinionated, easy-to-adopt presets; as real-world use cases multiply, the presets become bottlenecks; the tooling that survives is the tooling that learns to get out of the way. Phoenix is betting that agent evaluation has reached that inflection point.

Removing the deploy step is the actual innovation, not the scoring logic

The most consequential line in the Phoenix v16.0.0 announcement is not about composite scoring or embedding similarity. It is this: "no SDK, no local runtime, no deploy step."

Agent evaluation frameworks have historically demanded that teams own the full evaluation pipeline. You write your judge function locally, you test it in a notebook, you package it into something your eval harness can call, and then you deploy it alongside your observability stack. That pipeline has a name in engineering culture: it's called friction, and friction is where good intentions go to die. Teams that intend to run custom evals in production routinely end up running no custom evals in production, because the path from "I want to check this thing" to "this thing is being checked on every run" has too many steps.

Phoenix's Code Evaluators collapse that path. The user writes the function in the Phoenix UI. Phoenix runs it server-side. The results appear as annotations on every experiment run. There is no infrastructure to provision, no deployment pipeline to maintain, no local environment to replicate on the server. The gap between experimental and production evaluation disappears.

This is worth pausing on, because it's a different kind of product decision than it might appear. It would have been easier to ship a richer library of preset templates, add more judge personas, cover more domains. That's additive. What Phoenix shipped instead is subtractive: it removed the steps between intention and outcome. In tooling design, subtractive decisions are harder to make and more durable in their effects.

The broader pattern this resembles is what happened to deployment infrastructure over the past decade. The value didn't accrue to the teams that built better servers. It accrued to the platforms that made the deploy step invisible. If evaluation is becoming a first-class production concern (and the evidence suggests it is), then the platform that makes evaluation deployment invisible is positioning itself the same way those earlier infrastructure platforms did.

The Harness Hypothesis applies here: the value in AI systems increasingly lives not in the model itself but in the harness connecting the model to real-world assessment. Phoenix is building harness, not model.

Server-side execution changes what's possible for teams who don't run local infrastructure

There is a class of teams using agent frameworks today who are not running local Python environments against their production observability data. They are using hosted tools, often with minimal DevOps support, and they are making evaluation decisions based on whatever their platform provides by default. For these teams, the option to write a custom eval has been theoretically available for years. In practice, the option required infrastructure they didn't have.

Server-side execution of evaluation functions changes that calculus entirely. When Phoenix runs the function, the user's infrastructure doesn't matter. A team running agents through a hosted orchestration layer, with no local runtime, can still write and deploy a bespoke evaluation criterion. The evaluation capability scales with Phoenix's infrastructure, not the team's.

This matters for the ecosystem map in a specific way. Evaluation has historically been a premium capability, accessible to teams with engineering resources to build and maintain custom harnesses. Server-side eval execution is a democratizing move. It extends production-grade evaluation to teams that would otherwise be stuck with preset templates, not because they chose them, but because they lacked the infrastructure to use anything else.

Meanwhile, the release notes are specific about what this enables beyond the basics. Composite scoring allows teams to blend "LLM judgment + deterministic rules into one weighted metric." Embedding-based evaluation substitutes cosine similarity for string matching, which matters for any domain where semantically correct answers don't use identical phrasing. LLM-as-judge patterns are explicitly supported. These aren't experimental features. They're production patterns that teams have been cobbling together in notebooks and calling from ad-hoc scripts. Phoenix is formalizing them and making them server-executable.

The downstream effect, if adoption follows, is a shift in what the baseline for "adequate evaluation" looks like across the industry. Right now the baseline is: run some preset checks, log the results. The new baseline, if Phoenix's bet lands, will be: run your own scoring logic, server-side, on every run, with composite metrics. That's a materially higher floor.

The rest of this week's ecosystem releases show how much remains unevaluated

It is worth placing Phoenix v16.0.0 against what else shipped in the same window, because the contrast is instructive.

Goose v1.35.0 shipped a hooks system for pre- and post-tool execution, including a PreToolUse denial hook and a /goal command for agent self-evaluation before finishing a task. Useful extensibility, and the self-evaluation hook in particular gestures toward the same problem space Phoenix is in. But Goose's self-evaluation is an agent introspecting on its own goal completion. Phoenix's evaluation is an external observer scoring the agent's output against human-defined criteria. These are different things solving adjacent problems.

Agno v2.6.9 added resolved approval records to post-hooks, so observability integrations can now read the full record of who approved what and when. That's audit trail work, not evaluation. Claude Code shipped two releases: v2.1.147 pinned background sessions so they persist through idle periods and survive updates, and v2.1.148 fixed a regression that caused the Bash tool to return an error code on every command. Infrastructure fixes. Important, but not evaluation.

LangGraph 1.2.1 added stream transformers with a before-built-ins option. LangChain's OpenAI integration shipped a patch with audio chat test fixes and model context size corrections. Neither touches evaluation.

The pattern across a single day of agent ecosystem releases: improvements to execution, persistence, extensibility, and infrastructure. One release, Phoenix's, improved the ability to know whether any of it is working correctly.

This is not a criticism of the other projects. Execution infrastructure has to be solid before evaluation matters. But it does suggest that the ecosystem's investment in evaluation capability is still disproportionately low relative to its investment in execution capability. Teams can now run highly persistent, extensible, observable agents at scale. Figuring out whether those agents are actually doing the right thing remains largely a manual, ad-hoc process for most of them.

Phoenix is not solving the hardest evaluation problem, and that gap matters

The Code Evaluators feature deserves credit for what it is. It also deserves scrutiny for what it isn't.

Custom scoring functions execute against experiment runs. That means Phoenix's evaluation model is fundamentally retrospective: you run an experiment, the evaluator scores the results, you learn what happened. This is useful. It is not the same as real-time, in-context evaluation that can interrupt an agent mid-task when its behavior drifts outside acceptable parameters.

The Capability vs. Controllability Frontier describes a real tension in agent deployment: as agents become more capable and take longer-horizon actions, the window in which evaluation can intervene shrinks. An agent that researches, synthesizes, and acts over a twenty-step workflow creates most of its evaluation-relevant behavior in the middle of that workflow, not at the end. Post-hoc scoring on the final output tells you whether the output was acceptable. It doesn't tell you where the workflow went wrong, and it certainly doesn't stop a bad workflow midway.

Phoenix's approach works well for the experiment-and-iterate cycle that characterizes agent development. It works less well for agents deployed at the upper end of the Autonomy Spectrum, where real-time behavioral guardrails matter more than retrospective scoring. The release notes don't claim otherwise. But that gap is where the next wave of evaluation tooling will have to go.

Meanwhile, the Goose /goal command, which lets an agent self-evaluate before declaring a task finished, hints at a different evaluation architecture: one where the agent itself is a participant in the evaluation loop, not just the subject of it. These two approaches, external retrospective scoring and in-process self-assessment, are likely to converge in the next generation of agent evaluation frameworks. Phoenix is building strong infrastructure for the first. The ecosystem is beginning to experiment with the second.

Commoditize Your Complement explains why Arize is giving this away in the UI

Arize is an observability and evaluation platform. Its commercial product, Arize AX, competes in a space where paying customers want deep visibility into model and agent behavior. Phoenix, the open-source project, is the top-of-funnel: get teams dependent on Phoenix's instrumentation and evaluation primitives, then convert them to Arize AX when they need enterprise-grade features.

Code Evaluators in the Phoenix UI fit this strategy precisely. The more expressive Phoenix's evaluation capabilities become, the more teams instrument their agents with Phoenix. The more deeply instrumented a team's agent stack is, the higher the switching cost when they consider moving to a competing observability platform. Custom evaluation logic, especially logic that teams write and refine over months, is not easily portable. It embeds into workflows, gets referenced in retrospectives, and becomes part of how a team reasons about their agents' behavior.

This is the Commoditize Your Complement dynamic applied to evaluation. Arize's complement is the evaluation infrastructure that teams use before they're Arize customers. By making that infrastructure more powerful and free, Arize raises the value of the complement, deepens dependency, and strengthens the case for the paid product. The no-SDK, no-deploy-step design is not just a UX choice. It minimizes the activation energy required to become dependent on Phoenix's evaluation infrastructure.

The Bowling Alley framing also applies. Phoenix started with LLM application teams doing simple accuracy checks. Code Evaluators extend its reach to teams running complex agent workflows with bespoke scoring needs. The next pin in the sequence is likely real-time evaluation and guardrail enforcement, which would bring Phoenix into competition with safety and compliance platforms that currently operate in a separate market segment. That's a longer game, but the infrastructure being built now points in that direction.

The feature that will spread: evaluation as a first-class UI primitive

The specific implementation Phoenix chose, writing eval functions in a product UI rather than in code, will likely be copied. Not immediately, and not universally, but the direction is clear.

Evaluation has lived in notebooks, in CI pipelines, in ad-hoc scripts maintained by whoever cared enough to write them. It has been a development-time activity, not a product-time one. The decision to make evaluation function authorship a UI action, available without a deployment step, is a statement about where evaluation belongs in the workflow. It belongs in the product, visible to anyone who uses the product, not hidden in engineering infrastructure that only some team members can access.

This matters for the composition of teams that can engage with evaluation. If writing an eval requires a Python environment and deployment access, the people who engage with eval are engineers. If writing an eval requires opening a browser and typing a function into a UI, the people who engage with eval expand to include anyone who understands the domain well enough to describe what good looks like. Domain experts, not just engineers, can define scoring criteria. That's a meaningful shift in who participates in the evaluation loop.

The pattern resembles what happened to analytics a decade ago, when BI tools moved query authorship from SQL developers to product managers and analysts. The SQL was still running underneath. The interface changed who could write it. Phoenix's Code Evaluators do the same thing for evaluation logic: the function still runs, but the UI removes the barrier between having an evaluation criterion and having a deployed evaluation criterion.

As more agent platforms recognize that evaluation is a retention and trust driver (teams that can measure their agents are teams that continue to use them), the UI-native eval authorship pattern will spread. Phoenix shipped it first in the agent observability category. It will not be the last.

/Sources

/Key Takeaways

Phoenix v16.0.0's Code Evaluators let teams write custom scoring functions directly in the UI, server-side, with no deployment step. The friction between wanting a bespoke evaluation criterion and having one running in production is now close to zero.
The feature is less about the scoring logic itself and more about who gets to define evaluation criteria. UI-native eval authorship opens the evaluation loop to domain experts, not just engineers.
The ecosystem is heavily invested in execution infrastructure and underinvested in evaluation. One release this week directly addressed whether agents are doing the right thing. The rest improved how reliably agents do things.
Server-side eval execution is a strategic move: teams that embed custom scoring logic into Phoenix deepen their dependency on Arize's platform, raising switching costs and strengthening the commercial funnel.
The gap Phoenix doesn't close is real-time, in-process evaluation for long-horizon agent workflows. Post-hoc scoring works for experiment cycles. It is insufficient for agents operating at the upper end of the Autonomy Spectrum.