The scene: A three-person engineering team at OpenAI's agent division merges 1,500 pull requests in five months. That's roughly 3.5 PRs per person per day. The team's secret isn't smarter engineers — it's a constraint-driven harness that lets agents self-assign work from a Kanban board, execute tasks, and deliver proof of quality via CI results and walkthrough videos. No individual prompts. No hand-holding. Just a well-designed environment where the agents figure out how to ship.
Meanwhile, at Cursor, an engineering team runs 100 agents simultaneously on the same codebase, hitting 1,000 commits per hour at peak. The architecture that made this possible took four iterations: shared state files failed catastrophically at 20 agents, a message queue introduced coordination overhead that negated parallelism, event sourcing added complexity without solving the core problem, and finally — isolated repo copies with a recursive Planner-Worker pattern unlocked the throughput.
At Anthropic, a single Claude Code instance runs for four-plus hours on a complex task without drifting off-target. The trick: an independent Evaluator agent that operates the running application via Playwright, sharing zero internal state with the generator. When the generator starts going off-rails, the Evaluator catches it within seconds.
Three companies. Three different scaling challenges. One unified insight: the real battle in AI coding isn't about which model is smarter — it's about which organization can design better constraint environments for their agents.
This is the discipline of harness engineering: the art and science of building the environment that surrounds an AI agent, determining what it can perceive, how it can act, and what outcomes count as success. In 2026, harness engineering has become the defining competitive advantage in AI-assisted software development. Models are commoditizing fast. The infrastructure around them is not.
The Problem: When Your Agent Is Smart but Your System Is Dumb
Here's the uncomfortable data point that launched a thousand internal post-mortems: according to internal research by harness tooling vendors, AI coding heavy users have 69% higher deployment failure rates compared to teams using traditional development workflows. Not lower. Higher.
This is the paradox that blindsided engineering managers in 2024 and 2025. The promise was clear: give developers AI coding agents and watch velocity skyrocket. And the velocity did increase — often by 2x or 3x. But the downstream costs ate those gains. More code written faster meant more integration failures, more production incidents, more time spent debugging agent-generated code that was syntactically correct but architecturally wrong.
Martin Fowler, writing about the emerging agent paradigm, crystallized the problem with a deceptively simple formula:
Agent = Model + Harness
The model is the intelligence. The harness is everything else: the context that gets fed in, the tools that let the agent act, the feedback loops that correct drift, the constraints that bound behavior. Most organizations obsessing over model selection have been asking the wrong question. They're picking the brain but ignoring the nervous system.
The agents themselves aren't failing. The environments holding them are.
What Is Harness Engineering? (Beyond the Buzzword)
Harness engineering is the deliberate design of the constraint environment surrounding an AI agent. It encompasses everything in an AI agent system except the model itself.
This is a narrower definition than you might hear from prompt engineers or context engineers — and that's intentional. Prompt engineering focuses on crafting the inputs to the model. Context engineering focuses on managing what information the model sees. Harness engineering is broader: it's the entire scaffolding that determines what the agent can do, how it does it, and how success is measured.
Think of it like a car analogy:
- Model = the engine. More horsepower helps, but a 1,000-horsepower engine in a car with no steering, no brakes, and no dashboard gets you nowhere useful — just faster.
- Context = the fuel. High-quality fuel helps the engine run better, but you can optimize fuel all day and the car still won't drive itself.
- Harness = the steering, brakes, dashboard, suspension, and safety systems. This is what converts raw engine power into directed motion toward a destination.
A harness includes:
- Tool definitions and the orchestration logic that decides which tools get invoked when
- Success criteria expressed as verifiable constraints, not role descriptions
- Feedback mechanisms that detect drift and trigger correction
- State management that allows long-running tasks to maintain coherence
- Coordination protocols for multi-agent systems
Here's what a minimal harness configuration looks like in practice — an AGENTS.md file that defines the constraint environment for a coding agent:
# AGENTS.md — Harness Configuration
## Success Criteria
- All functions must have type annotations
- No TODO comments in submitted code
- Cyclomatic complexity below 15 for all new functions
- All public APIs must have docstrings
## Constraints
- Never use `as any` or `@ts-ignore`
- Never suppress errors with empty catch blocks
- All new files must follow existing directory conventions
## Verification
- Run `npm run lint` before committing
- Run `npm test` before submitting PR
- All PRs must pass CI before merge
## Context Management
- Read WORKSPACE.md before searching for files
- Follow existing patterns in the codebase
- When in doubt, ask for clarification
This isn't a prompt — it's a constraint specification. The agent doesn't need to "remember" these rules because they're verified automatically by the linter and CI pipeline. For a deeper exploration of this concept, see our earlier guide to Harness Engineering which traces the historical origins from the Spinning Jenny to modern AI systems.
The distinction matters because organizations that treat harness engineering as an afterthought end up with agents that are powerful but undirected — generating lots of activity without producing useful outcomes.
The Three Scaling Dimensions Framework
Every AI coding agent system runs into scaling problems. Not because of the model, but because of the infrastructure around it. After studying production deployments at Anthropic, Cursor, OpenAI, and dozens of other organizations, a clear pattern emerges: there are exactly three scaling dimensions, and each one requires a fundamentally different architectural approach.
The Three Dimensions
1. Time Scaling — How do you keep a single agent productive over long time horizons? The problem isn't intelligence — it's coherence. After 4+ hours of continuous work, agents start losing track of the original goal, pursuing sub-tasks that don't aggregate to meaningful progress, or repeating the same error patterns without self-correction. Time scaling is about maintaining direction across extended execution runs.
2. Space Scaling — How do you coordinate multiple agents working on the same problem? The naive approach — shared state, shared context — degrades rapidly as agent count increases. At Cursor, they hit a wall at 20 simultaneous agents. The coordination overhead grew faster than the parallelism benefit. Space scaling is about maintaining throughput as you add agents.
3. Interaction Scaling — How do you steer a large number of agents without writing individual prompts for each? If you have 500 agents and you're crafting instructions for each one, you haven't scaled — you've just moved the bottleneck. Interaction scaling is about specifying intent at the system level and letting agents self-organize.
The three companies in this article converged on these dimensions not through theoretical planning, but through empirical firefighting. Anthropic's primary challenge was Time. Cursor's was Space. OpenAI's was Interaction. Each company developed their solutions independently, and the convergence on the three-dimension framework emerged from cross-referencing their architectures.
The mapping: Anthropic → Time Scaling. Cursor → Space Scaling. OpenAI → Interaction Scaling.
This isn't a coincidence. These are the problems each organization encountered first and solved first, based on their specific use cases and organizational context. But the framework is universal — every AI coding system will eventually confront all three dimensions as it scales.
Time Scaling: How Anthropic Keeps a Single Agent Running for Hours
The drift problem is the first thing every long-running agent encounters. Ask an agent to refactor a 50,000-line monolith in a single session, and something predictable happens: around the 2-3 hour mark, the agent starts making locally coherent but globally inconsistent decisions. It loses sight of the original architecture goals. It refactors module A in a way that breaks module C. It introduces patterns that were correct in isolation but wrong in context.
The standard solution in early 2025 was self-reflection: have the agent periodically pause and ask itself "am I still on track?" This doesn't work. Agents are poor judges of their own coherence because they have no external ground truth. They can't see what they've already changed in the codebase — they can only see what they're currently doing.
Anthropic's solution for Claude Code was an independent Evaluator agent that operates the running application via Playwright.
The architecture splits the harness into two completely isolated components:
- Generator — The agent doing the actual code work. It receives context, writes code, proposes changes.
- Evaluator — An independent agent that has zero shared state with the Generator. It watches the running application through Playwright, clicks through the UI, checks that the expected behaviors are present, and reports failures back to the Generator.
The Evaluator doesn't know what the Generator is trying to do. It just knows the success criteria: "when I click this button, this modal should appear." If the modal doesn't appear, it flags a failure. The Generator then uses that failure signal to course-correct, without needing to understand why the failure occurred.
This separation works because the Evaluator and Generator share no internal state. The Generator can't "hallucinate" an excuse for why the modal didn't appear. The Evaluator can't be manipulated by the Generator's self-reports. The only shared artifact is the running application itself.
The Auto-Dream Mechanism
Anthropic also implemented what their engineers internally call auto-dream: a background memory consolidation process that runs during idle cycles. When the agent isn't actively receiving user input, it reviews its recent action history and extracts high-level principles about what worked and what didn't. This is inspired by sleep consolidation in biological systems — the brain replaying the day's experiences and solidifying patterns.
The auto-dream mechanism is particularly valuable for complex, multi-step refactoring tasks where the agent needs to maintain a coherent mental model of the target architecture across hundreds of individual changes.
Speculative Execution and Pipelining
The advanced time-scaling technique Anthropic uses is speculative execution: predicting the user's next command based on context and pre-executing it. If the agent has been running a series of refactors on a specific module, it can anticipate that the next step will be testing or deployment and begin preparing those contexts in parallel.
This requires a careful balance — speculative execution that turns out to be wrong wastes resources and can introduce confusion. But when calibrated correctly, it reduces the perceived latency between user commands from seconds to milliseconds, making the agent feel like it's always one step ahead.
Space Scaling: Cursor's Four Architecture Iterations
When Cursor's engineering team first tried running multiple agents on the same codebase, they expected linear scaling. Ten agents should produce roughly ten times the throughput of one agent. The reality was humbling.
Iteration 1: Shared State File
The first architecture was simple: all agents read from and wrote to a shared state file that tracked the current task list, file ownership, and progress. Agents would acquire locks on files before editing, preventing conflicts.
This failed at 20 agents. The problem wasn't the locking mechanism — it was the coordination overhead. As agent count increased, the probability that any two agents needed the same resource approached certainty. Lock contention caused throughput to collapse from "20 agents = 20x throughput" to "20 agents = 1-3x throughput." The shared state file became a serialization bottleneck.
Iteration 2: Message Queue
The second iteration replaced the shared state file with a message queue (RabbitMQ, in this case). Agents communicated by publishing and subscribing to messages rather than reading shared files. This reduced lock contention but introduced new problems: message ordering guarantees were imperfect, agents sometimes processed stale information, and the queue itself became a single point of failure.
More critically, coordination overhead grew quadratically. With N agents, the number of potential communication paths is N×(N-1)/2. At 100 agents, that's 4,950 potential message flows. The queue couldn't absorb this complexity gracefully.
Iteration 3: Event Sourcing
The third iteration adopted event sourcing: instead of agents communicating directly, they published events to a distributed log (Kafka). Other agents could subscribe to relevant event streams. This provided eventual consistency and better fault tolerance.
But event sourcing added significant complexity without solving the core problem. Debugging became harder — tracing a bug through an event-sourced system meant replaying months of events. The eventual consistency model also made it difficult to guarantee that agents wouldn't make conflicting changes based on stale state.
Iteration 4: Recursive Planner-Worker with Isolated Repo Copies
The breakthrough came from abandoning shared state entirely. The final architecture gives each Worker agent an isolated copy of the repository. Workers don't communicate with each other directly — they only communicate with a central Planner.
Here's how it works:
- Planner receives the high-level task (e.g., "implement user authentication")
- Planner decomposes the task into sub-tasks and assigns each to a Worker with its own repo copy
- Workers execute independently, making commits to their isolated copies
- Planner monitors Worker progress and handles merge conflicts by re-decomposing conflicting areas
- Peak throughput: 1,000 commits per hour with 100+ concurrent Workers
The key insight: coordination overhead grows quadratically with shared state, but linearly with isolated copies. When Workers don't share state, they don't need to coordinate about state — they only need to coordinate about task boundaries, which is a much simpler problem.
Cursor's 1,000 commits/hour figure was measured during a stress test where 150 agents worked simultaneously on a 500,000-line codebase. The architecture has since been refined further, but the core principle remains: isolation is the answer to space scaling.
Interaction Scaling: OpenAI's Symphony and the Ticket-Driven Agent
OpenAI's agent division faced a different scaling problem. They weren't trying to run one agent for hours, or coordinate hundreds of agents on one codebase. They were trying to steer hundreds of agents simultaneously without writing individual prompts for each one.
The solution they built is called Symphony: a constraint-driven harness that uses Linear as the agent job scheduler and Elixir/BEAM as the concurrency backbone.
Why Elixir/BEAM?
BEAM (the Erlang Virtual Machine that powers Elixir, Erlang, and Axon) was designed for telecom systems — environments where you need to handle millions of concurrent connections with predictable latency. The BEAM's actor model makes it natural to spawn thousands of lightweight processes, each representing an agent, with built-in fault tolerance and message passing.
For Symphony, this means OpenAI can run hundreds of agents in a single BEAM node, with each agent as an independent actor that communicates via message passing. The BEAM handles load balancing, failure recovery, and backpressure automatically.
The Ticket-Driven Workflow
Human engineers on the Symphony team don't write prompts for agents. They write tickets — the same way they'd write tickets for human engineers. The ticket describes a feature, a bug fix, or a refactoring task, with acceptance criteria.
Agents self-assign from a Kanban board (managed in Linear). They pick up tickets, execute the work, and deliver Proof of Work: not a description of what they did, but artifacts that demonstrate completion — passing CI results, complexity analysis reports, walkthrough videos of the changed behavior.
The ticket moves through Kanban states (To Do → In Progress → In Review → Done), and agents update these states just like human engineers would. The human team's job is to review the Proof of Work and decide whether to merge.
The Three-Person Team
The numbers are striking: a three-person human team supported by Symphony merged approximately 1,500 pull requests in five months. That's roughly 3.5 PRs per person per day. The humans spent almost no time writing code — they reviewed agent output, handled edge cases that agents couldn't resolve, and made architectural decisions.
This works because task specification via constraints is more scalable than task specification via instructions. Instead of telling the agent how to implement a feature, you tell it what the feature must do (via acceptance criteria) and what architectural rules it must follow (via linter configuration). The linter errors ARE the repair instructions — when the agent sees a lint error, it knows exactly what to fix.
The key insight: Instructions are ambiguous. Constraints are verifiable. An instruction like "remember to write tests for your code" leaves room for interpretation. A constraint like "no function may exceed 50 lines without a linter error" doesn't.
The One Thing They All Agree On: Constraints Beat Instructions
Anthropic, Cursor, and OpenAI converged on the same fundamental principle from completely different directions: constraints are more effective than instructions for directing agent behavior at scale.
This finding appears in three completely different implementations:
OpenAI: Custom Linters as Architectural Invariants
Symphony enforces architecture through custom linter rules. These aren't generic linters — they're rules specific to the codebase's conventions, written by the engineering team. When an agent violates an architectural invariant, the linter fires and the error message itself serves as the repair guide.
The agent doesn't need to be told "don't introduce circular dependencies." The linter error message tells it exactly which import cycle was created and which file to modify to break it.
Here's a simplified example of a custom ESLint rule that enforces architectural constraints:
// custom-lint-rules/no-cross-domain-imports.js
// Enforces domain boundary — agents can't import across domains
module.exports = {
create(context) {
const domainPattern = /src\/(auth|billing|catalog)\/.*/;
return {
ImportDeclaration(node) {
const sourceFile = context.getFilename();
const importPath = node.source.value;
const sourceDomain = sourceFile.match(domainPattern)?.[1];
const importDomain = importPath.match(domainPattern)?.[1];
if (sourceDomain && importDomain && sourceDomain !== importDomain) {
context.report({
node,
message: `Cross-domain import: ${sourceDomain} → ${importDomain}. ` +
`Use the ${importDomain} API interface instead. ` +
`Fix: replace this import with a call to ${importDomain}/api.ts`
});
}
}
};
}
};
Notice the error message doesn't just say "don't do this" — it tells the agent exactly how to fix it. This is constraint-driven design: the linter error IS the instruction.
Cursor: No TODOs, No Partial Implementations
Cursor's internal research found that agent productivity improved dramatically when they constrained the output format, not the process. Specifically, a rule that agents could not leave TODOs in code and could not submit partial implementations (code that compiles but doesn't fully implement the described feature) produced better outcomes than explicit instructions like "remember to finish your implementations."
The constraint is verifiable: either the code has no TODOs, or it has a linter error. Either the implementation is complete, or it fails the acceptance tests. This is inherently more reliable than asking agents to remember procedural instructions.
Anthropic: Constrain Deliverables, Not Paths
Anthropic's approach to time scaling explicitly constrains what the Evaluator checks, not what the Generator does. The Evaluator doesn't know the implementation path — it only knows the expected behavior. This means the Generator has freedom to choose how to solve a problem, constrained only by the success criteria at the end.
This mirrors the Unix philosophy: small tools, constrained inputs, predictable outputs. The constraint is the interface.
Why Constraints Work
Constraints are verifiable. Instructions are ambiguous. A constraint can be automatically checked — a linter can verify it, a test can verify it, a human can verify it in seconds. An instruction like "write clean code" cannot be automatically verified. It requires judgment, and agents — like humans — have inconsistent judgment under load.
The other reason constraints work is that they compose. When you add a new constraint to a system, you know exactly what it prohibits. When you add a new instruction, you're expanding the space of possible behaviors in unpredictable directions.
When Models Get Stronger, Harnesses Get Leaner — Or Do They?
The conventional wisdom in 2024 was that as models improved, harness complexity would decrease. Stronger models would need less scaffolding. They would handle edge cases, maintain coherence, and self-correct without explicit feedback loops.
Anthropic's experience with Opus 4.6 suggested otherwise — but in a revealing way.
Some sprint structures became unnecessary with the stronger model. Agents could maintain coherence over longer periods without the Evaluator checkpointing. The auto-dream mechanism ran more efficiently. Certain classes of errors simply disappeared.
But here's what didn't change: the 300-500 step problem. When an agent task requires hundreds of sequential steps — real refactoring projects, large feature implementations, system migrations — errors compound regardless of model strength. Opus 4.6 might make fewer individual errors, but when errors do occur, they're just as damaging to the final outcome.
The Amplification Problem
Research from Google and MIT on multi-agent systems revealed a troubling pattern: independent agents amplify errors by 17.2x, while centralized coordinators reduce error amplification to 4.4x. The numbers aren't about model quality — they're about architecture. When agents operate independently without a coordinating constraint layer, their individual errors compound multiplicatively.
This is why Cursor's Planner-Worker architecture includes the Planner as a central coordinator, not just a task splitter. The Planner doesn't just hand out work — it monitors for cascading errors and re-decomposes tasks when Workers produce conflicting outputs.
The XSD Validation Approach
Some advanced harness designs are borrowing from XML's validation model: formal verification between agent layers. Before an agent's output moves to the next stage, a formal validator checks that the output conforms to a defined schema. This isn't linting — it's a stricter contract.
For AI coding agents, this might mean: before a Generator's code changes are evaluated by an Evaluator, a formal model checker verifies that the changes preserve the architectural invariants. The constraint is explicit, machine-verifiable, and independent of the Generator's self-assessment.
The answer to "do stronger models reduce harness complexity?" is: it depends on what kind of complexity. Model improvements reduce the complexity of keeping a single agent on track. They don't reduce the complexity of coordinating multiple agents or verifying correctness at scale.
Token Economics: The Hidden Driver of Harness Design
Here's the number that changes every harness engineering decision: Jensen Huang's $250,000 per year token budget per engineer.
This isn't a hypothetical. Jensen Huang, in internal Nvidia discussions about AI coding adoption, set the benchmark: if an engineer's AI coding tools cost more than 50% of their salary ($250K/year for a $500K engineer), the ROI becomes questionable. Every harness design decision is ultimately a token economics decision.
The Power User Numbers
Power users of AI coding agents — engineers who use Claude Code, Cursor, or Copilot as their primary development environment — are consuming 200-300 million tokens per day. That's not a typo.
At current token prices (approximately $2.50-$3.00 per million tokens for Claude Sonnet), that's $500-$900 per day per power user. Annualized, that's $180,000-$325,000 per power user — already exceeding the $250K threshold.
But the trajectory is what matters. Three years ago, the same token volume would have cost approximately $30 per million tokens. The price dropped 92% in three years. If this trend continues, the same power user consuming 300M tokens/day would cost $54,000/year at 2029 prices — well within budget.
Prompt Caching: The 90% Discount
Anthropic's prompt caching feature provides a 90% discount on cache hits. For repeated or long-running tasks, this changes the economics dramatically. A task that would cost $100 without caching costs $10 with caching.
This creates a direct incentive for harness design that maximizes cache hits: structure tasks so that common components (tool definitions, system prompts, shared context) are cached and reused. The harness that optimizes for cache locality directly reduces token costs.
Harness Design IS Cost Design
Every unnecessary context reload, every redundant tool call, every prompt that includes information the model doesn't need — these aren't just efficiency problems. They're direct costs. A harness that causes a model to re-read 50,000 tokens of codebase context on every tool call is not just slow — it's expensive.
The organizations winning at harness engineering in 2026 are the ones treating token economics as a first-class design constraint, not an afterthought. The harness that wastes 20% more tokens than necessary might be $50K/year per engineer in wasted costs.
The Global Picture: Harness Practices Across Ecosystems
Harness engineering isn't just a Western phenomenon. The Chinese AI Builder community — a fast-growing ecosystem of developers building AI-native applications — has developed distinct approaches to the same scaling problems.
Feishu/DingTalk: CLI-First Agent Interfaces
Major Chinese productivity platforms are launching CLI-first agent interfaces. Feishu (Lark) and DingTalk are both releasing command-line tools that integrate with their existing collaboration suites, with MCP (Model Context Protocol) not being the primary integration path.
This diverges from the Western trend, where MCP has become the de facto standard for agent-tool integration. Chinese developers seem to be preferring direct CLI integration over protocol-based tool definitions, possibly because CLI tools are more familiar in a development context. (We explored this CLI-first trend in detail in our analysis of why the command line is winning the AI Agent interface war.)
The practical difference: with CLI-first integration, agents invoke tools via shell commands rather than structured protocol messages. This is less expressive but simpler to implement and debug.
The 75,000-Line Skill Document
The Chinese AI Builder community has produced skill documents of remarkable depth — some exceeding 75,000 lines of detailed procedural knowledge. These aren't prompt templates. They're comprehensive guides to integrating AI coding agents into existing engineering workflows, covering everything from repository structure to code review processes.
The community has also produced 1.6 million PRDs (Product Requirement Documents) in their shared knowledge bases, creating a corpus of human-AI collaboration patterns that dwarfs anything in the English-speaking world.
WeChat as Agent Channel
Perhaps the most striking global divergence: WeChat as an agent control channel. With 1.3 billion users, WeChat has become an unexpected platform for agent interaction. Users can message an agent account, delegate coding tasks, and receive progress updates — all within a familiar interface.
This is interaction scaling taken to its logical extreme: instead of building a custom UI for agent control, use an existing communication platform. The agent's "interaction surface" is a WeChat contact.
The CLI vs MCP Divergence
This global divergence deserves attention. The Western ecosystem is converging on MCP (Model Context Protocol) as the standard for agent-tool integration. The Chinese ecosystem is building CLI-first integrations that sidestep protocol standardization.
This is reminiscent of the early internet's divergence on web standards — different ecosystems solving similar problems with different tools, eventually leading to interoperability challenges. The Klarna case offers another lens on this: when Klarna restructured 1,200 SaaS tools into an AI-consumable three-layer architecture, the bottleneck wasn't which protocol to use — it was whether knowledge could be structured for agent consumption at all. (See our complete timeline of Klarna's AI restructuring for the full story.)
The SQL History Repetition: In the 1980s, SQL became the standard for database queries in the West, while Chinese systems developed alternative query paradigms. The fragmentation led to decades of interoperability problems. MCP vs CLI may be the 2026 version of the same divergence.
Building Your First Harness: A Practical Checklist
Enough theory. Here's how to apply harness engineering principles to your own AI coding setup.
Step 1: Define Success Criteria (Not Roles)
Before you write any prompts, define what success looks like. Not "you are a senior software engineer" — that's a role, not a success criterion. Success criteria are measurable outcomes.
Examples of good success criteria: - "All functions have type annotations" - "No TODO comments in submitted code" - "Cyclomatic complexity below 15 for all new functions" - "All public APIs have docstrings"
These are verifiable. A linter can check them automatically. You can enforce them without subjective judgment.
Step 2: Choose Your Scaling Dimension
Identify which dimension matters most for your use case:
- Time Scaling: You need this if your agents run for more than 30 minutes continuously
- Space Scaling: You need this if you're running more than 3 agents simultaneously
- Interaction Scaling: You need this if you're trying to steer more than 10 agents without individual prompts
Most use cases need all three eventually, but start with the dimension that matches your current bottleneck.
Step 3: Design Constraints Before Writing Prompts
Write your linter rules, test suites, and acceptance criteria first. These are your constraints. They tell the agent what outcomes to achieve without specifying how.
Only after defining constraints should you write prompts — and prompts should be minimal, focused on explaining the constraint framework, not on instructing the agent step-by-step.
Step 4: Build Verification Loops
Every constraint needs a verification mechanism:
| Constraint Type | Verification Mechanism |
|---|---|
| Code style | Linter (ESLint, Ruff, golangci-lint) |
| UI behavior | Playwright/Cypress tests |
| Logic correctness | Unit tests + property-based tests |
| Performance | Benchmarking suite |
| Architecture | Custom linter rules or formal validators |
Without verification, you have no feedback loop. Without feedback, agents drift.
Step 5: Measure Token Economics From Day One
Track your token consumption per task from the beginning. This will reveal:
- Which prompts waste context
- Where caching opportunities exist
- Whether your harness is cost-effective
A simple logging mechanism that records token usage per agent session costs almost nothing to implement and pays dividends in optimization opportunities.
FAQ: Harness Engineering
Q: Is harness engineering just a rebrand of prompt engineering?
No. Prompt engineering focuses on crafting the text that goes into the model. Harness engineering encompasses everything surrounding the model: tool definitions, state management, feedback mechanisms, coordination protocols, and cost optimization. Prompt engineering is one component of harness engineering.
Q: Do I need a harness if I'm just using ChatGPT?
For casual use, probably not. The scaling problems that harnesses solve (time drift, space coordination, interaction management) don't manifest at the casual use level. Harnesses become necessary when you're running agents for extended periods, coordinating multiple agents, or trying to steer a large agent fleet with system-level constraints.
Q: Which scaling dimension should I focus on first?
Start with the dimension that matches your current bottleneck. If your agents drift after 30 minutes, focus on Time Scaling (add Evaluator checkpoints). If you're struggling to coordinate 5+ simultaneous agents, focus on Space Scaling (introduce isolation). If you find yourself writing individual prompts for each agent, focus on Interaction Scaling (switch to constraint-driven specifications).
Q: How does harness engineering relate to MCP and agent protocols?
MCP (Model Context Protocol) is a specific protocol for defining how agents communicate with tools. It's a component of harness engineering, not a replacement for it. A harness that uses MCP still needs constraints, verification loops, and token economics optimization. MCP makes tool integration easier, but it doesn't solve the fundamental harness design problems.
Q: Can I use harness engineering with open-source models?
Yes. The principles are model-agnostic. Constraints, verification loops, and coordination protocols work regardless of which model is running underneath. The specific implementation details may vary (some models have different context windows or tool-calling capabilities), but the harness framework applies universally.
Q: What's the difference between a harness and an agent framework?
An agent framework (LangChain, AutoGen, CrewAI) provides infrastructure for building agents: orchestration, memory management, tool integration. A harness is the specific configuration of that infrastructure for a particular task. Think of the framework as the car chassis, and the harness as the steering, brakes, and dashboard that make the car useful for a specific purpose.
The Discipline, Not the Detail
Harness engineering is still a young discipline. The terminology isn't standardized, the best practices are still being discovered, and the tooling is fragmented. But the core insight is sound: the bottleneck in AI-assisted development isn't the model — it's the environment.
Three companies reached this conclusion independently. Anthropic discovered it through time scaling challenges. Cursor discovered it through space scaling challenges. OpenAI discovered it through interaction scaling challenges. The convergence on the three-dimension framework emerged from empirical firefighting, not theoretical design.
The organizations that will win in AI coding aren't the ones with the best models. They're the ones with the best harnesses: the clearest constraints, the fastest feedback loops, the most efficient token economics, and the most scalable coordination architectures.
The model gets the glory. The harness does the work.
References
- Martin Fowler on Agent Architecture — martinfowler.com
- Anthropic Claude Code Documentation — anthropic.com
- Cursor Engineering Blog — cursor.com/blog
- OpenAI Symphony Architecture — openai.com
- Google/MIT Multi-Agent Error Amplification Research — arxiv.org
- Jensen Huang AI ROI Framework — Nvidia internal benchmarks
- Token Price Trajectory Data — Anthropic pricing page, OpenAI pricing page
- Chinese AI Builder Community — Feishu developer documentation, DingTalk Open Platform
- MCP (Model Context Protocol) Specification — modelcontextprotocol.io
- Cursor 1000 Commits/Hour Architecture — cursor.com/engineering