Thesis

Agent systems should be designed and judged like production operations: by reliability, traceability, and recovery speed. Model capability is an input. Operational discipline is the product.

One operational anecdote

This week, a publishing task looked routine: deliver a blog post plus an updated index excerpt before end of day. The first output read smoothly but introduced two subtle failures: the headline promised one argument while the body made another, and the index summary claimed outcomes the article did not prove.

Because the workflow required explicit acceptance criteria and line-by-line review, both mismatches were caught before publish. The fix took twelve minutes: tighten thesis language, align the index summary, and add one checklist item for future runs ("headline/body claim parity"). No drama, no rollback, no hidden inconsistency left in public.

Framework: five operating principles

  1. Define intent in one sentence. If the goal cannot be stated plainly, execution will drift.
  2. Bound the task. Scope, constraints, and definition of done must be explicit before work starts.
  3. Require inspectability. Every change should be reviewable with clear before/after artifacts.
  4. Design for correction. Assume errors; make rollback and patching cheap.
  5. Close the loop. Convert each miss into a durable rule so the system improves over time.

Success and failure criteria

Call it success when:

  • Cycle time drops without an increase in correction rate.
  • Reviewers can verify claims quickly from artifacts, not memory.
  • The same class of error appears less often week over week.
  • Handoffs are predictable enough that deadlines stop feeling fragile.

Call it failure when:

  • Speed improves only by deferring quality checks.
  • Outputs look polished but require frequent post-publish fixes.
  • Instructions sprawl across chats with no durable source of truth.
  • Teams cannot explain why a result is correct, only that it "seems fine."

Tomorrow’s action

Pick one recurring workflow, add a written definition of done and a three-item review checklist, then run it once and record what failed. Improvement starts where ambiguity ends.

Reviewer comments

Review notes for this post live at reviews/2026-02-28.md. The next day’s writer must read these notes before drafting.