Source: Effective Harnesses for Long-Running Agents — Justin Young, 2025

Summary

Long-running AI agents start every session with a blank slate. Without scaffolding, they fail in two ways: they try to do too much at once, run out of context, and leave things half-built; or they see existing progress and prematurely declare the work done. Both problems come down to the agent having no way to know what’s been done and what’s left.

Young proposes a two-agent setup using the Claude Agent SDK. An initializer agent runs once to set up the environment. It creates a startup script, a progress log, an initial git commit, and a structured JSON feature list with 200+ entries, each marked passes: false. A coding agent handles all later sessions. The feature list is the key mechanism: it prevents the agent from trying to do everything at once and stops it from declaring victory while incomplete features remain. The agent can only flip passes to true. It cannot modify feature descriptions or test steps.

Each session, the coding agent reads git logs and the progress file to rebuild context, picks the highest-priority incomplete feature, starts the dev server, and runs baseline tests before doing any new work. It works one feature at a time, commits with descriptive messages, and updates the progress file before ending. A major reliability gain came from having the agent use Puppeteer to test features in a real browser rather than just trusting the code it wrote, though some things like browser-native alert modals remain invisible to it.

Quotes

“Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.”

“The key insight here was finding a way for agents to quickly understand the state of work when starting with a fresh context window, which is accomplished with the claude-progress.txt file alongside the git history.”

“Providing Claude with these kinds of testing tools dramatically improved performance, as the agent was able to identify and fix bugs that weren’t obvious from the code alone.”

Notes

The practical question is how repeatable this is. Could the harness components (the feature list, progress file, startup routine, and session prompts) be generated consistently across different projects, or does each one need a custom setup? The approach is compelling, but its adoption at scale depends on how much of the scaffolding can be made generic and easy to generate.