Building Evaluation Frameworks for Agent Systems

ReactorJet · 2026

Most teams building agent systems don't have an evaluation framework. They have a collection of spot checks: a few saved prompts they run manually, some eyeball comparisons between model versions, maybe a spreadsheet tracking which configuration "felt better" on a handful of cases. This works until it doesn't, which is usually around the time you need to make a decision that actually matters: which model to deploy, whether a prompt change helped or regressed, whether your retrieval pipeline is pulling the right context.

Evaluation is the highest-leverage investment in any agent project. Not because scoring is intrinsically interesting, but because without it you can't iterate with confidence. Every other decision downstream depends on your ability to measure whether things got better or worse.

The four dimensions

Agent output is multidimensional. A response can be correct but unsafe (accurate answer that leaks PII). It can be safe but wasteful (harmless non-answer that burned 40k tokens). It can be efficient but useless (fast wrong answer). Collapsing all of this into a single score obscures more than it reveals.

We evaluate across four orthogonal dimensions:

Useful. Does the output serve the task? Did it answer what was asked, complete what was requested, produce something the user can act on? This is the most intuitive dimension but also the one teams over-index on at the expense of the other three.

Safe. Does the output avoid harm? No data loss, no security holes, no broken state, no leaked credentials. Safety signals are binary in failure mode: a 99% safe agent that occasionally drops a database table is not a safe agent.

Efficient. Does it minimize waste? Token usage relative to task complexity, number of tool calls, redundant operations, error recovery overhead. Efficiency matters because it's a proxy for cost and latency at scale.

Valuable. Does it create real value for the specific context? An agent that produces a technically correct but contextually useless response (answering a question nobody asked, over-engineering a simple request) scores well on Useful but poorly on Valuable. This dimension captures the gap between "correct" and "worth the resources spent."

These four dimensions apply regardless of domain. A coding agent, a research agent, and a customer support agent all need to be useful, safe, efficient, and valuable. What changes is the set of signals within each dimension.

Signals, not scores

Within each dimension, evaluation is decomposed into signals: specific, observable, measurable things you can check. A signal like retrieval_precision (did the agent pull relevant context?) belongs to the Useful dimension. no_data_loss (did it avoid overwriting user data?) belongs to Safe. token_efficiency (tokens used relative to task complexity) belongs to Efficient.

Signals have three properties that matter:

Weight. Not all signals are equally important. no_data_loss should dominate the Safe dimension for a coding agent. retrieval_precision might matter more than output_format_compliance for a research agent. Weights encode domain priorities.

Scoring method. Some signals can be scored by heuristic (did the agent use fewer than N tool calls?). Some require an LLM judge (was the reasoning chain coherent?). Some benefit from both. The framework should support all three: heuristic-only, judge-only, and hybrid scoring where the heuristic provides a baseline and the judge refines it.

Domain specificity. A base set of signals applies everywhere, but each domain adds its own. Memory evaluation adds retrieval_precision, temporal_reasoning, selective_forgetting, and memory_poisoning_detection. Multi-agent coordination adds handoff_quality, parallelism_ratio, and consensus_convergence. The evaluation framework needs to be extensible without becoming a configuration nightmare.

Rubric scoring vs. LLM-as-judge

There are two broad approaches to automated agent evaluation, and you need both.

Rubric scoring is deterministic. You define expected outcomes, expected tool sequences, efficiency thresholds, and score against them. Did the agent produce the expected output? Did it follow the expected process? Did it stay within resource bounds? Rubric scoring is fast, reproducible, and covers the cases where you know what "good" looks like in advance.

The limitation is obvious: you can't write a rubric for everything. "Was the agent's explanation clear and helpful?" requires judgment. "Did the agent handle an ambiguous request gracefully?" requires understanding intent. These are real quality dimensions that matter in production.

LLM-as-judge fills this gap. You give a judge model the agent's trace (input, reasoning, tool calls, output) and a scoring rubric, and ask it to evaluate. The judge prompt for a signal like memory_informed_decisions might be: "Did the agent's decisions improve because of memory retrieval? Compare reasoning before and after memory reads."

The strengths and weaknesses are complementary. Rubric scoring is cheap and deterministic but can't assess subjective quality. Judge scoring handles nuance but is expensive, non-deterministic, and subject to its own biases. A production evaluation framework runs both and uses rubric scores as the foundation with judge scores layered on top for the signals that require them.

Comparing configurations

Evaluation gets interesting when you use it to compare configurations. A configuration is anything you can vary: model, temperature, system prompt, retrieval strategy, tool definitions, agent architecture. The question is always the same: is configuration A better than configuration B, and by how much?

Pairwise comparison with Bradley-Terry ranking handles this well. You run both configurations against the same task set, compare their scores dimension by dimension, and compute a ranking with confidence intervals. The confidence interval matters: a configuration that's 2% better on average with wide confidence intervals isn't meaningfully better. You need enough signal to distinguish real improvement from noise.

This is where most teams get stuck. They compare two prompts on five examples, eyeball the results, and ship whichever "looked better." That's not evaluation; that's hope. Proper comparison requires enough task coverage to be statistically meaningful, and the framework needs to tell you when your sample size isn't large enough to draw conclusions.

Specialized evaluation domains

Generic evaluation dimensions cover the common case, but some agent capabilities require purpose-built evaluation:

Memory. Agents that persist state across interactions need to be evaluated on retrieval precision, temporal reasoning (does it understand that newer information supersedes older?), selective forgetting (does it expire irrelevant context?), and poisoning resistance (does it reject injected false memories?). These signals don't exist in a generic framework. You have to build them.

Context management. Agents operating under context budget constraints need evaluation on allocation efficiency, compression quality (when summarizing to fit, do they preserve what matters?), and source prioritization. This is increasingly important as agents chain longer workflows and context windows fill up.

Multi-agent coordination. Systems where multiple agents collaborate need evaluation on handoff quality, parallelism efficiency, message overhead (how much coordination chatter relative to useful work?), and consensus convergence. Each domain has its own data model, signals, and scoring logic, but they all feed into the same four dimensions.

Making evaluation iterative

The real value of an evaluation framework isn't the scores. It's the iteration loop: change something, measure whether it helped, decide whether to keep the change. This only works if evaluation is cheap enough to run frequently and comprehensive enough to catch regressions.

Task coverage matters. If your evaluation suite only tests the happy path, you won't catch the regression that breaks edge cases. Coverage analysis should identify which capabilities are well-tested and which have gaps, then generate new tasks to fill them. Mutation-based task generation (take an existing task, increase difficulty, change domain, add constraints, introduce adversarial elements) can expand coverage systematically without requiring manual task authoring for every case.

Regression detection matters. When you run evaluation daily across evolving configurations, you need automated alerts when a dimension score drops. Not just "it went down" but "it went down more than normal variance, and here's which signals degraded." This turns evaluation from a one-off comparison into continuous quality monitoring.

What we ship

Our evaluation framework, agent-eval, implements all of this: four dimensions with extensible signal sets, rubric and LLM-as-judge scoring, Bradley-Terry configuration ranking, specialized modules for memory and coordination evaluation, mutation-based task generation, and regression detection. It's open source because evaluation methodology shouldn't be proprietary. The value is in how you apply it to your specific system, not in the scoring machinery itself.

If you're building agent systems and don't have evaluation infrastructure yet, that's the first thing to fix. Everything else gets easier once you can measure whether your changes are improvements.