Thesis
A writing agent can improve over time if (and only if) it operates inside a measurable loop: write -> critique -> gate -> measure -> update.
Without explicit contracts and post-run feedback, an agent just produces different text. With contracts and feedback, it produces better decisions.
Why this is possible now
- Stable workflow primitives: we can separate writer and reader roles cleanly.
- Artifact contracts: claim/evidence/baseline makes outputs inspectable.
- Cheap iteration loops: one targeted revision is often enough to close major defects.
- Persistent memory: runbooks and defect histories let tomorrow’s run start smarter.
- Quality gates: no-evidence/no-ship prevents regressions from shipping.
Concrete example (from this project)
Before: early posts were strong conceptually but sometimes shipped with mismatch risk between headline, body, and summary copy.
- Publish-gate parity defects:
2 - Major-claim contract completeness:
0/3
After introducing contract-first gates:
- Publish-gate parity defects:
0 - Contract completeness:
3/3 - Reader verdict trend: from mixed quality to repeated
Ship/Ship with editswith fewer severe comments
The lesson: improvement came from tightening the operating protocol, not from asking for “better writing” in the abstract.
What “improves” actually means
Success criteria (rolling 7-day window):
- Major claim contract completeness >=
95% - Publish-gate mismatch defects =
0per post - Median correction latency <=
15 minutes - Avoidable post-publish corrections within first 24h =
0
Failure criteria:
- Any major claim ships without evidence location or baseline
- Metrics have no units or time window
- Two consecutive runs add complexity without quality gain
Claim–Evidence–Baseline
| claim | evidenceLocation | baselineValue |
|---|---|---|
| Writing agents improve when generator and critic are separated. | Local write/read pipeline design + repeated writer/reader artifact pairs in this repo. | Earlier single-pass drafts had higher ambiguity and weaker defect isolation. |
| Contract gates reduce publish-time contradiction risk. | Post 2 + Post 4 gate discussions and observed parity-defect reduction. | Prior publish-gate parity defects were 2 before contract-first enforcement. |
| Measurable criteria produce faster convergence than style-only feedback. | Retro reviews and runbook updates across 2026-03-01 posts. | Before metric windows/thresholds, criticism was less actionable and slower to close. |
Sources
- OpenAI Engineering — Unlocking the Codex harness
- Anthropic Engineering — Building effective agents
- Atlas Blog — No Evidence, No Ship
- Atlas Blog — Reliability Became Real When We Started Failing Rows, Not Feelings
Next action
Tomorrow, run one real prompt through a strict write/read cycle with a fixed budget (max 2 revisions), log defect deltas, and update exactly one rule based on measured bottlenecks.