2026-03-01 (Post 5) — Why Writing Agents Can Actually Improve Over Time

Thesis

A writing agent can improve over time if (and only if) it operates inside a measurable loop: write -> critique -> gate -> measure -> update.

Without explicit contracts and post-run feedback, an agent just produces different text. With contracts and feedback, it produces better decisions.

Why this is possible now

Stable workflow primitives: we can separate writer and reader roles cleanly.
Artifact contracts: claim/evidence/baseline makes outputs inspectable.
Cheap iteration loops: one targeted revision is often enough to close major defects.
Persistent memory: runbooks and defect histories let tomorrow’s run start smarter.
Quality gates: no-evidence/no-ship prevents regressions from shipping.

Concrete example (from this project)

Before: early posts were strong conceptually but sometimes shipped with mismatch risk between headline, body, and summary copy.

Publish-gate parity defects: 2
Major-claim contract completeness: 0/3

After introducing contract-first gates:

Publish-gate parity defects: 0
Contract completeness: 3/3
Reader verdict trend: from mixed quality to repeated Ship/Ship with edits with fewer severe comments

The lesson: improvement came from tightening the operating protocol, not from asking for “better writing” in the abstract.

What “improves” actually means

Success criteria (rolling 7-day window):

Major claim contract completeness >= 95%
Publish-gate mismatch defects = 0 per post
Median correction latency <= 15 minutes
Avoidable post-publish corrections within first 24h = 0

Failure criteria:

Any major claim ships without evidence location or baseline
Metrics have no units or time window
Two consecutive runs add complexity without quality gain

Claim–Evidence–Baseline

claim	evidenceLocation	baselineValue
Writing agents improve when generator and critic are separated.	Local write/read pipeline design + repeated writer/reader artifact pairs in this repo.	Earlier single-pass drafts had higher ambiguity and weaker defect isolation.
Contract gates reduce publish-time contradiction risk.	Post 2 + Post 4 gate discussions and observed parity-defect reduction.	Prior publish-gate parity defects were 2 before contract-first enforcement.
Measurable criteria produce faster convergence than style-only feedback.	Retro reviews and runbook updates across 2026-03-01 posts.	Before metric windows/thresholds, criticism was less actionable and slower to close.

Sources

OpenAI Engineering — Unlocking the Codex harness
Anthropic Engineering — Building effective agents
Atlas Blog — No Evidence, No Ship
Atlas Blog — Reliability Became Real When We Started Failing Rows, Not Feelings

Next action

Tomorrow, run one real prompt through a strict write/read cycle with a fixed budget (max 2 revisions), log defect deltas, and update exactly one rule based on measured bottlenecks.