Agent Checkpoints for Reliable Runs

Most agent demos show one clean loop: receive goal, call tools, return answer. Production runs are messier. A browser session times out. A code edit works but tests fail. A tool returns partial data. A human approves one step and rejects another. If the agent cannot stop, explain where it is, and resume from known state, every failure becomes a full restart.

Checkpointing solves that problem. A checkpoint is a saved snapshot of an agent run at an important boundary. It does not make the model smarter. It makes the system easier to inspect, retry, debug, and govern.

Checkpointed agent run: save state before expensive or risky steps

What a checkpoint should capture

A useful checkpoint is more than chat history. It records enough context to answer: "What did the agent know, what did it decide, what changed, and what can happen next?"

Field	Purpose	Example
Run identity	Group events from same task	`run_id`, user, repo, environment
Step boundary	Resume from stable point	`planned`, `tool_started`, `tool_finished`, `awaiting_approval`
Inputs	Reconstruct decision context	prompt, retrieved docs, selected files
Plan	Compare intent with action	ordered steps, budgets, stop condition
Tool calls	Audit side effects	tool name, arguments, stdout, exit code
Artifacts	Preserve outputs	patch file, report, generated URL
Approval state	Keep humans in loop	approver, decision, comment, timestamp
Versions	Avoid stale resumes	model, prompt template, tool schema, code SHA

The goal is not to store everything forever. The goal is to store enough to resume safely and review later.

Place checkpoints at boundaries

Do not checkpoint after every token. Use boundaries where risk or cost changes.

Good checkpoint points:

After intake — task, user constraints, allowed tools, and definition of done.
After planning — proposed steps, assumptions, budgets, and required approvals.
Before side effects — file writes, API calls, deploys, emails, payments, or database changes.
After tool calls — command, arguments, result, error, and changed artifacts.
Before human approval — exact item under review and recommended decision.
After final verification — tests, eval result, generated artifact, and final status.

This gives you replay points without turning storage into noise.

Minimal checkpoint shape

Start with a plain JSON object. Add fields when real failures demand them.

{
  "run_id": "run_2026_06_06_001",
  "step_id": "tool_003",
  "status": "tool_finished",
  "created_at": "2026-06-06T09:00:00+03:00",
  "agent": {
    "model": "example-model",
    "prompt_version": "agent-review-v4"
  },
  "state": {
    "goal": "Review pull request for risky changes",
    "plan": ["inspect diff", "run tests", "write findings"],
    "budget": {"tool_calls_remaining": 8, "deadline_seconds": 900}
  },
  "tool_call": {
    "name": "run_tests",
    "arguments": {"command": "pytest -q"},
    "exit_code": 1,
    "summary": "2 tests failed in auth flow"
  },
  "next": {
    "allowed_actions": ["inspect_failure", "request_human_review"],
    "resume_from": "tool_003"
  }
}

Keep secrets out. Store references to sensitive material, not raw credentials. If a tool call needs access tokens, checkpoint the vault key name or permission scope, not the token value.

Resume rules matter more than storage

A checkpoint is only useful if the runner knows how to resume. Define rules before you need them.

Use these defaults:

Retry read-only steps automatically when error is transient and budget remains.
Never replay side effects blindly. A deploy, email, payment, file deletion, or database mutation needs idempotency key or human confirmation.
Re-check external state before resume. The world may have changed since checkpoint was written.
Invalidate checkpoints on incompatible versions. If tool schemas, prompts, or code changed, require review before resume.
Prefer forward recovery. If a file was edited and tests failed, continue from edited artifact instead of restarting from original prompt.

This is where many agent systems fail: they save state but treat resume as "run again." Reliable agents treat resume as a controlled transition.

Checkpoints, traces, and evals work together

OpenTelemetry describes traces as a way to follow work across services. Agent checkpoints are not a replacement for traces; they are the semantic state behind key spans. A trace can show that a tool call took 12 seconds and failed. A checkpoint can show the plan, arguments, approval state, artifact path, and safe next actions.

For evals, checkpoints give you comparable run records. Instead of scoring only final answers, you can inspect:

Did the agent ask for approval before risky action?
Did it stop when budget was exhausted?
Did it use allowed tools only?
Did it recover from a failed tool call?
Did it produce evidence for final claim?

Those questions matter more in production than a single success/failure label.

Implementation checklist

For a small builder project, implement checkpointing in this order:

Create runs and checkpoints tables or append-only JSONL files.
Assign every task a stable run_id.
Save checkpoints at intake, plan, tool result, approval, and final status.
Add idempotency keys to tools with side effects.
Build a show run <id> command for humans.
Add resume run <id> --from <checkpoint> with version checks.
Redact secrets before writing checkpoints.
Use checkpoint records in evals and incident reviews.

The first version can be simple. A directory of JSON files is enough for local coding agents. A database becomes useful when multiple workers, approvals, and dashboards need same state.

Common mistakes

Saving raw prompts but not tool outputs.
Saving tool outputs but not tool arguments.
Allowing resume after code or schema changed without warning.
Treating approval as a chat message instead of structured state.
Keeping huge logs in checkpoint rows instead of linking artifacts.
Forgetting retention rules for user data.

Practical rule

Checkpoint before anything expensive, irreversible, or hard to explain. If a human would ask "what happened here?" after failure, save enough state to answer before failure happens.

That small habit changes agent engineering. You stop treating runs as disposable conversations and start treating them as inspectable workflows.

END OF NOTE