Evaluate Agents with Task Traces, Not Vibes
A practical foundation for evaluating agentic systems by capturing task traces, scoring behavior, and turning regressions into release gates.
Evaluate Agents with Task Traces, Not Vibes
Agent evaluations fail when they only ask, “Did the final answer look good?” That misses most of what makes agentic systems hard: tool choice, intermediate reasoning, retries, permissions, cost, latency, and recovery from bad observations.
A useful agent eval treats each run as a task trace. The final answer still matters, but it is only one output of a multi-step workflow. You want to know what the agent saw, what it decided to do, which tools it called, what came back, where it corrected course, and whether the work stayed inside allowed boundaries.
Why final-answer evals are not enough
Traditional LLM evals often compare one prompt to one answer. Agents are different. They can:
- call tools with side effects
- query private or changing data
- branch across multiple steps
- fail halfway and recover
- produce acceptable answers through unsafe paths
- produce wrong answers after correct intermediate work
If you only grade final text, you can ship an agent that passes demos while silently overusing tools, skipping checks, leaking data into logs, or making brittle assumptions.
A trace-based eval asks two questions:
- Did the agent complete the task?
- Did it complete the task through an acceptable process?
Both matter in production.
What to capture in each trace
Start with a small schema. Do not capture everything. Capture enough to debug and compare behavior between versions.
| Field | Why it matters |
|---|---|
| task_id | Connects run to eval case and expected outcome |
| input | Shows what the agent was asked to do |
| expected_result | Defines success without relying on memory |
| steps | Shows planning, tool calls, observations, and retries |
| tool_calls | Exposes arguments, permissions, latency, and errors |
| final_output | Lets humans and automated graders judge result quality |
| policy_events | Records approvals, denials, blocked actions, and warnings |
| cost_latency | Finds slow or expensive regressions |
For OpenTelemetry-style thinking, a trace is a record of work as it moves through a system. Agent traces use the same basic idea: a task becomes a sequence of observable steps. Each step should carry enough metadata to explain what happened later.
Build a small eval set first
Most teams should not start with hundreds of cases. Start with 20 to 50 tasks that represent real usage. Mix easy, normal, and adversarial cases.
A practical eval set includes:
- Golden paths: common tasks the agent should always handle.
- Tool-choice cases: tasks where the right tool matters more than fluent text.
- Recovery cases: missing files, failing APIs, ambiguous instructions, stale context.
- Boundary cases: requests that should be refused, escalated, or sent for approval.
- Regression cases: bugs you already fixed and never want to see again.
Write each task like a test fixture:
task_id: create_issue_from_bug_report
input: "Read bug-report.md and open a GitHub issue with reproduction steps."
expected_result:
issue_created: true
includes_repro_steps: true
does_not_modify_code: true
allowed_tools:
- read_file
- github_create_issue
blocked_tools:
- git_push
- delete_file
This format forces clarity. The agent can still be flexible, but the eval says what success means.
Score behavior in layers
Use several small scores instead of one vague grade. A run can be correct but too expensive. It can be safe but incomplete. It can complete the task while using a forbidden tool. Separate scores make failures actionable.
Good first metrics:
| Metric | Example check |
|---|---|
| Task success | Did expected result happen? |
| Tool correctness | Did agent choose allowed tools with valid arguments? |
| Safety | Did agent avoid blocked actions and unsafe data exposure? |
| Recovery | Did agent handle tool errors without looping? |
| Efficiency | Did agent stay under step, token, cost, or latency budgets? |
| Evidence quality | Did final answer cite files, commands, or observations used? |
Some scores can be deterministic. For example, blocked tool usage is a simple rule. Other scores need human review or model-assisted judging. Keep those rubrics short and inspect samples often. A judge prompt is not ground truth; it is another component that needs monitoring.
Turn traces into release gates
Trace evals are most useful when they block regressions before deployment. Add them to same workflow where you run unit tests and integration tests.
A simple release gate:
- Run eval set against current production agent.
- Run same set against proposed agent change.
- Compare success, safety, latency, and cost.
- Fail release if critical tasks regress.
- Save failing traces as new regression cases after fixes.
This turns evals into a flywheel. Every incident, bug, and surprising trace becomes a better future test.
Common mistakes
Avoid these patterns:
- Only testing happy paths. Agents fail in tool errors, ambiguous tasks, and edge permissions.
- Changing eval tasks constantly. You need stable baselines to see regressions.
- Scoring everything with an LLM judge. Use deterministic checks whenever possible.
- Ignoring process. Safe systems care how work was done, not only what was returned.
- Keeping traces too private to debug. Redact secrets, but preserve enough evidence for review.
Practical starting point
If you are building your first production agent, do this today:
- Pick 25 real tasks from logs, support tickets, or developer workflows.
- Define expected outcomes, allowed tools, and blocked tools for each task.
- Capture one trace per run with steps, tool calls, observations, and final output.
- Score task success, forbidden actions, tool errors, total steps, and latency.
- Review the five worst traces by hand every week.
- Add every fixed bug as a regression case.
This is not glamorous. It works because it turns agent quality into visible evidence. Good evals make agents less mysterious: you see what changed, why it changed, and whether that change is safe enough to ship.