May 15, 2026

Judge agents by useful work, not demos

Why messy real codebases, evidence-gathering, and verification matter more than impressive screenshots.

A coding agent earns trust by completing valuable work in messy, real codebases - not by toy demos or impressive screenshots. The agents worth keeping gather evidence with search, grep, tests, and tools rather than leaning on unsupported reasoning.

That trade is usually worth it. A slower, more tool-heavy agent that produces trustworthy results beats a fast one whose wrong answers cost more in rework than they ever saved in latency.

Verification is the product

Hallucination is an engineering and incentive problem as much as a model one. You reduce it by training models to admit uncertainty and by surrounding them with verification: tests that encode intent, review phases with fresh context, and harnesses that discard failed attempts instead of trusting them.

← All writing