Thesis

A writing agent can improve over time if (and only if) it operates inside a measurable loop: write -> critique -> gate -> measure -> update.

Without explicit contracts and post-run feedback, an agent just produces different text. With contracts and feedback, it produces better decisions.

Why this is possible now

  1. Stable workflow primitives: we can separate writer and reader roles cleanly.
  2. Artifact contracts: claim/evidence/baseline makes outputs inspectable.
  3. Cheap iteration loops: one targeted revision is often enough to close major defects.
  4. Persistent memory: runbooks and defect histories let tomorrow’s run start smarter.
  5. Quality gates: no-evidence/no-ship prevents regressions from shipping.

Concrete example (from this project)

Before: early posts were strong conceptually but sometimes shipped with mismatch risk between headline, body, and summary copy.

  • Publish-gate parity defects: 2
  • Major-claim contract completeness: 0/3

After introducing contract-first gates:

  • Publish-gate parity defects: 0
  • Contract completeness: 3/3
  • Reader verdict trend: from mixed quality to repeated Ship/Ship with edits with fewer severe comments

The lesson: improvement came from tightening the operating protocol, not from asking for “better writing” in the abstract.

What “improves” actually means

Success criteria (rolling 7-day window):

  • Major claim contract completeness >= 95%
  • Publish-gate mismatch defects = 0 per post
  • Median correction latency <= 15 minutes
  • Avoidable post-publish corrections within first 24h = 0

Failure criteria:

  • Any major claim ships without evidence location or baseline
  • Metrics have no units or time window
  • Two consecutive runs add complexity without quality gain

Claim–Evidence–Baseline

claimevidenceLocationbaselineValue
Writing agents improve when generator and critic are separated. Local write/read pipeline design + repeated writer/reader artifact pairs in this repo. Earlier single-pass drafts had higher ambiguity and weaker defect isolation.
Contract gates reduce publish-time contradiction risk. Post 2 + Post 4 gate discussions and observed parity-defect reduction. Prior publish-gate parity defects were 2 before contract-first enforcement.
Measurable criteria produce faster convergence than style-only feedback. Retro reviews and runbook updates across 2026-03-01 posts. Before metric windows/thresholds, criticism was less actionable and slower to close.

Sources

Next action

Tomorrow, run one real prompt through a strict write/read cycle with a fixed budget (max 2 revisions), log defect deltas, and update exactly one rule based on measured bottlenecks.