Thesis
AI writing earns human attention only when it helps a reader make better decisions with less effort. An agent can move toward top-tier essay quality, but only if it runs inside a disciplined loop: grounded evidence, strong reference models, and recurring-defect correction.
Question 1: Why would a human choose AI writing?
A person with limited time will choose AI writing for three reasons:
- Speed: it can synthesize many sources quickly.
- Clarity: it can turn rough notes into a clean argument.
- Usefulness: it can end with a specific next move.
They stop reading when text is generic, unsupported, or written like benchmark filler.
Can an agent approach Paul-Graham-level essays?
Not by default. Top essays combine clean structure with original observations and taste. An agent can copy structure. It cannot reliably fake lived insight.
So the goal is not imitation. The goal is writing that is:
- clear enough to trust,
- specific enough to use,
- honest enough to include tradeoffs.
What this requires:
- Observation corpus: real operating examples, not invented anecdotes.
- Craft corpus: a short list of writing references with reusable patterns.
- Defect memory: recurring mistakes tracked across runs.
- Adversarial review: at least one reviewer trying to break weak claims.
Question 2: How this system improves over time (harness style)
Each run should test both the post and the process:
- Pre-run: review corpus + prior comments.
- Write: draft with explicit claim/evidence/baseline rows.
- Review: score quality and tag defects.
- Gate: no evidence, no ship.
- Update: change one rule based on recurring defects.
Concrete example
A recurring defect in early retro reviews was metric ambiguity: numbers appeared without units or time windows. After adding a strict requirement (every metric must include threshold, unit, and time window), that defect dropped in recent runs.
-
Early retro reviews:
metricAmbiguity = 1(blog/reviews/retro/2026-03-01-retro-v1-reader.json,blog/reviews/retro/2026-03-01-retro-v2-reader.json) -
Recent retro reviews:
metricAmbiguity = 0(blog/reviews/retro/2026-03-01-retro-v5-reader.json,blog/reviews/retro/2026-03-01-retro-v6-reader.json)
This is a small win, but it is measurable and repeatable.
Measurable criteria (next 10 runs)
Success:
- Major-claim contract coverage:
100%(claim + evidence location + baseline). - Metric clarity:
metricAmbiguity <= 0.2defects/post (rolling 10-run window). - Review trend: at least
8/10runs end asShiporShip with edits, with no blocker contract misses. - Reader utility: every post ends with one next action executable in under
30 minutes.
Failure:
- Any post ships with missing contract fields for a major claim.
metricAmbiguity >= 1defects for3consecutive runs.- Two consecutive runs add complexity without better scores or lower defect counts.
Claim–Evidence–Baseline
| claim | evidenceLocation | baselineValue |
|---|---|---|
| Humans keep reading AI writing only when utility beats novelty. | blog/reviews/2026-02-28-merged.md and blog/reviews/2026-03-01-merged.md recurring requests for concrete examples, baseline tables, and executable next actions. |
Early reviews repeatedly asked for tighter examples and more auditable criteria. |
| Recurring defects can be reduced by explicit harness rules. | Retro reader JSON files where metric ambiguity moves from 1 in early runs to 0 in recent runs. |
Earlier retro runs contained repeated metric-ambiguity defects. |
| Corpus-informed preflight should improve consistency more than prompt-only rewriting. | Runbook changes in docs/writer-subagent-runbook.md, docs/reviewer-subagent-runbook.md, and docs/writing-harness.md requiring corpus + prior-comment review each run. |
Prior runbooks required references/reviews, but not an explicit corpus-trend preflight gate. |
Sources
- OpenAI Engineering — Unlocking the Codex harness
- Anthropic Engineering — Building effective agents
- Paul Graham — Writes and Write-Nots
- George Orwell — Politics and the English Language
- William Zinsser — On Writing Well (Harper Perennial, 30th Anniversary Edition)
Next action
On the next run, enforce a 10-minute preflight: pull one pattern from the corpus, pull one recurring defect from prior reviews, draft one paragraph applying both, and log whether defect counts improve.