← All notes
ProductionMay 25, 20264 min read

Evaluate Agents with Task Traces, Not Vibes

A practical foundation for evaluating agentic systems by capturing task traces, scoring behavior, and turning regressions into release gates.

evalsagentsobservabilityproduction

Evaluate Agents with Task Traces, Not Vibes

Agent evaluations fail when they only ask, “Did the final answer look good?” That misses most of what makes agentic systems hard: tool choice, intermediate reasoning, retries, permissions, cost, latency, and recovery from bad observations.

A useful agent eval treats each run as a task trace. The final answer still matters, but it is only one output of a multi-step workflow. You want to know what the agent saw, what it decided to do, which tools it called, what came back, where it corrected course, and whether the work stayed inside allowed boundaries.

Task-trace evaluation loop
Task-trace evaluation loop: task set, agent run, trace capture, scoring, and regression gate

Why final-answer evals are not enough

Traditional LLM evals often compare one prompt to one answer. Agents are different. They can:

  • call tools with side effects
  • query private or changing data
  • branch across multiple steps
  • fail halfway and recover
  • produce acceptable answers through unsafe paths
  • produce wrong answers after correct intermediate work

If you only grade final text, you can ship an agent that passes demos while silently overusing tools, skipping checks, leaking data into logs, or making brittle assumptions.

A trace-based eval asks two questions:

  1. Did the agent complete the task?
  2. Did it complete the task through an acceptable process?

Both matter in production.

What to capture in each trace

Start with a small schema. Do not capture everything. Capture enough to debug and compare behavior between versions.

FieldWhy it matters
task_idConnects run to eval case and expected outcome
inputShows what the agent was asked to do
expected_resultDefines success without relying on memory
stepsShows planning, tool calls, observations, and retries
tool_callsExposes arguments, permissions, latency, and errors
final_outputLets humans and automated graders judge result quality
policy_eventsRecords approvals, denials, blocked actions, and warnings
cost_latencyFinds slow or expensive regressions

For OpenTelemetry-style thinking, a trace is a record of work as it moves through a system. Agent traces use the same basic idea: a task becomes a sequence of observable steps. Each step should carry enough metadata to explain what happened later.

Build a small eval set first

Most teams should not start with hundreds of cases. Start with 20 to 50 tasks that represent real usage. Mix easy, normal, and adversarial cases.

A practical eval set includes:

  • Golden paths: common tasks the agent should always handle.
  • Tool-choice cases: tasks where the right tool matters more than fluent text.
  • Recovery cases: missing files, failing APIs, ambiguous instructions, stale context.
  • Boundary cases: requests that should be refused, escalated, or sent for approval.
  • Regression cases: bugs you already fixed and never want to see again.

Write each task like a test fixture:

task_id: create_issue_from_bug_report
input: "Read bug-report.md and open a GitHub issue with reproduction steps."
expected_result:
  issue_created: true
  includes_repro_steps: true
  does_not_modify_code: true
allowed_tools:
  - read_file
  - github_create_issue
blocked_tools:
  - git_push
  - delete_file

This format forces clarity. The agent can still be flexible, but the eval says what success means.

Score behavior in layers

Use several small scores instead of one vague grade. A run can be correct but too expensive. It can be safe but incomplete. It can complete the task while using a forbidden tool. Separate scores make failures actionable.

Good first metrics:

MetricExample check
Task successDid expected result happen?
Tool correctnessDid agent choose allowed tools with valid arguments?
SafetyDid agent avoid blocked actions and unsafe data exposure?
RecoveryDid agent handle tool errors without looping?
EfficiencyDid agent stay under step, token, cost, or latency budgets?
Evidence qualityDid final answer cite files, commands, or observations used?

Some scores can be deterministic. For example, blocked tool usage is a simple rule. Other scores need human review or model-assisted judging. Keep those rubrics short and inspect samples often. A judge prompt is not ground truth; it is another component that needs monitoring.

Turn traces into release gates

Trace evals are most useful when they block regressions before deployment. Add them to same workflow where you run unit tests and integration tests.

A simple release gate:

  1. Run eval set against current production agent.
  2. Run same set against proposed agent change.
  3. Compare success, safety, latency, and cost.
  4. Fail release if critical tasks regress.
  5. Save failing traces as new regression cases after fixes.

This turns evals into a flywheel. Every incident, bug, and surprising trace becomes a better future test.

Common mistakes

Avoid these patterns:

  • Only testing happy paths. Agents fail in tool errors, ambiguous tasks, and edge permissions.
  • Changing eval tasks constantly. You need stable baselines to see regressions.
  • Scoring everything with an LLM judge. Use deterministic checks whenever possible.
  • Ignoring process. Safe systems care how work was done, not only what was returned.
  • Keeping traces too private to debug. Redact secrets, but preserve enough evidence for review.

Practical starting point

If you are building your first production agent, do this today:

  1. Pick 25 real tasks from logs, support tickets, or developer workflows.
  2. Define expected outcomes, allowed tools, and blocked tools for each task.
  3. Capture one trace per run with steps, tool calls, observations, and final output.
  4. Score task success, forbidden actions, tool errors, total steps, and latency.
  5. Review the five worst traces by hand every week.
  6. Add every fixed bug as a regression case.

This is not glamorous. It works because it turns agent quality into visible evidence. Good evals make agents less mysterious: you see what changed, why it changed, and whether that change is safe enough to ship.

END OF NOTE