Agent Checkpoints for Reliable Runs
A practical foundation for adding checkpoints to agent runs so builders can inspect decisions, resume safely, retry failed steps, and keep human approval in control.
Agent Checkpoints for Reliable Runs
Most agent demos show one clean loop: receive goal, call tools, return answer. Production runs are messier. A browser session times out. A code edit works but tests fail. A tool returns partial data. A human approves one step and rejects another. If the agent cannot stop, explain where it is, and resume from known state, every failure becomes a full restart.
Checkpointing solves that problem. A checkpoint is a saved snapshot of an agent run at an important boundary. It does not make the model smarter. It makes the system easier to inspect, retry, debug, and govern.
What a checkpoint should capture
A useful checkpoint is more than chat history. It records enough context to answer: "What did the agent know, what did it decide, what changed, and what can happen next?"
| Field | Purpose | Example |
|---|---|---|
| Run identity | Group events from same task | run_id, user, repo, environment |
| Step boundary | Resume from stable point | planned, tool_started, tool_finished, awaiting_approval |
| Inputs | Reconstruct decision context | prompt, retrieved docs, selected files |
| Plan | Compare intent with action | ordered steps, budgets, stop condition |
| Tool calls | Audit side effects | tool name, arguments, stdout, exit code |
| Artifacts | Preserve outputs | patch file, report, generated URL |
| Approval state | Keep humans in loop | approver, decision, comment, timestamp |
| Versions | Avoid stale resumes | model, prompt template, tool schema, code SHA |
The goal is not to store everything forever. The goal is to store enough to resume safely and review later.
Place checkpoints at boundaries
Do not checkpoint after every token. Use boundaries where risk or cost changes.
Good checkpoint points:
- After intake — task, user constraints, allowed tools, and definition of done.
- After planning — proposed steps, assumptions, budgets, and required approvals.
- Before side effects — file writes, API calls, deploys, emails, payments, or database changes.
- After tool calls — command, arguments, result, error, and changed artifacts.
- Before human approval — exact item under review and recommended decision.
- After final verification — tests, eval result, generated artifact, and final status.
This gives you replay points without turning storage into noise.
Minimal checkpoint shape
Start with a plain JSON object. Add fields when real failures demand them.
{
"run_id": "run_2026_06_06_001",
"step_id": "tool_003",
"status": "tool_finished",
"created_at": "2026-06-06T09:00:00+03:00",
"agent": {
"model": "example-model",
"prompt_version": "agent-review-v4"
},
"state": {
"goal": "Review pull request for risky changes",
"plan": ["inspect diff", "run tests", "write findings"],
"budget": {"tool_calls_remaining": 8, "deadline_seconds": 900}
},
"tool_call": {
"name": "run_tests",
"arguments": {"command": "pytest -q"},
"exit_code": 1,
"summary": "2 tests failed in auth flow"
},
"next": {
"allowed_actions": ["inspect_failure", "request_human_review"],
"resume_from": "tool_003"
}
}
Keep secrets out. Store references to sensitive material, not raw credentials. If a tool call needs access tokens, checkpoint the vault key name or permission scope, not the token value.
Resume rules matter more than storage
A checkpoint is only useful if the runner knows how to resume. Define rules before you need them.
Use these defaults:
- Retry read-only steps automatically when error is transient and budget remains.
- Never replay side effects blindly. A deploy, email, payment, file deletion, or database mutation needs idempotency key or human confirmation.
- Re-check external state before resume. The world may have changed since checkpoint was written.
- Invalidate checkpoints on incompatible versions. If tool schemas, prompts, or code changed, require review before resume.
- Prefer forward recovery. If a file was edited and tests failed, continue from edited artifact instead of restarting from original prompt.
This is where many agent systems fail: they save state but treat resume as "run again." Reliable agents treat resume as a controlled transition.
Checkpoints, traces, and evals work together
OpenTelemetry describes traces as a way to follow work across services. Agent checkpoints are not a replacement for traces; they are the semantic state behind key spans. A trace can show that a tool call took 12 seconds and failed. A checkpoint can show the plan, arguments, approval state, artifact path, and safe next actions.
For evals, checkpoints give you comparable run records. Instead of scoring only final answers, you can inspect:
- Did the agent ask for approval before risky action?
- Did it stop when budget was exhausted?
- Did it use allowed tools only?
- Did it recover from a failed tool call?
- Did it produce evidence for final claim?
Those questions matter more in production than a single success/failure label.
Implementation checklist
For a small builder project, implement checkpointing in this order:
- Create
runsandcheckpointstables or append-only JSONL files. - Assign every task a stable
run_id. - Save checkpoints at intake, plan, tool result, approval, and final status.
- Add idempotency keys to tools with side effects.
- Build a
show run <id>command for humans. - Add
resume run <id> --from <checkpoint>with version checks. - Redact secrets before writing checkpoints.
- Use checkpoint records in evals and incident reviews.
The first version can be simple. A directory of JSON files is enough for local coding agents. A database becomes useful when multiple workers, approvals, and dashboards need same state.
Common mistakes
- Saving raw prompts but not tool outputs.
- Saving tool outputs but not tool arguments.
- Allowing resume after code or schema changed without warning.
- Treating approval as a chat message instead of structured state.
- Keeping huge logs in checkpoint rows instead of linking artifacts.
- Forgetting retention rules for user data.
Practical rule
Checkpoint before anything expensive, irreversible, or hard to explain. If a human would ask "what happened here?" after failure, save enough state to answer before failure happens.
That small habit changes agent engineering. You stop treating runs as disposable conversations and start treating them as inspectable workflows.