Harness Engineering โ€” The Real Bottleneck Isn't the Model

Harness Engineering โ€” The Real Bottleneck Isn't the Model

If 2025 was the year of the agent, 2026 is the year of the harness. The hottest concept in AI agent development right now: the reliability bottleneck of AI agents isnโ€™t the model โ€” itโ€™s the system around the model. Harness engineering is the discipline of designing environments, constraints, and feedback loops that make agents reliably useful. The metaphor: the model is the engine, but without a steering wheel and brakes, you canโ€™t reach the destination.

*Source: Anthropic: Effective Harnesses for Long-Running Agents Anthropic: Harness Design for Long-Running Apps HumanLayer: Skill Issue โ€” Harness Engineering learn-claude-code (30K+ stars) NxCode: Complete Guide Learn Harness Engineering (walkinglabs course) Weibo highlight by ่š‚ๅทฅๅŽ‚ Akshay Pachaar โ€” The Anatomy of an Agent Harness (X, visual primer)*

What Is LangChain?

LangChain is one of the most popular open-source frameworks for building applications with LLMs. Think of it as a toolkit that connects language models to real-world tools โ€” databases, APIs, file systems, code interpreters. It provides the plumbing: chains (sequential steps), agents (autonomous decision-makers), memory (conversation state), and tools (external capabilities). If Claude is the brain, LangChain is the skeleton and nervous system that lets it actually do things.

Why it matters here: LangChain builds and maintains their own coding agent โ€” an AI that writes, runs, and debugs code autonomously. They benchmark it on Terminal Bench, a standardized test suite for coding agents. Their result became the poster child for harness engineering.

The Counterintuitive Evidence

LangChainโ€™s Coding Agent on Terminal Bench โ€” same model, only harness optimizations โ€” went from Top 30+ to Top 5:

Harness Optimization Impact
System prompt optimization 85% of performance gain
Tool configuration optimization 90%
Middleware hooks 95%

They didnโ€™t upgrade the model. They didnโ€™t fine-tune anything. They optimized three things around the model: (1) how they instructed it via system prompts, (2) which tools they exposed and how, and (3) automated middleware that caught errors before they compounded. The model was identical โ€” only the harness changed.

This challenges a deeply-held assumption in AI development: that better results require better models. Often, they just require better harnesses.

Agent = Model + Harness

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Model (Engine)    โ”‚     โ”‚ Harness (Steering +  โ”‚
โ”‚                     โ”‚     โ”‚        Brakes)        โ”‚
โ”‚ โ€ข Powerful          โ”‚     โ”‚                      โ”‚
โ”‚   intelligence      โ”‚     โ”‚ โ€ข System prompts     โ”‚
โ”‚ โ€ข Fast reasoning    โ”‚     โ”‚ โ€ข Tool constraints   โ”‚
โ”‚ โ€ข Doesn't know      โ”‚     โ”‚ โ€ข Verification loops โ”‚
โ”‚   where to go       โ”‚     โ”‚                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                             โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ–ผ  โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚   Agent    โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

"The best engine, without steering and brakes,
 can't get anywhere useful."

Context Engineering vs Harness Engineering

These are related but distinct disciplines:

Dimension Context Engineering Harness Engineering
Focus What to show the agent How to constrain & verify the agent
Concern Context window management Prevention / measurement / repair
Methods Information filtering & timing Architectural constraints + verification loops
Scope Single conversation Across all sessions
Goal Right info at right time Reliability at scale
Metaphor Choosing what the horse sees Building the reins, saddle, and fences

Context engineering is a subset of harness engineering โ€” itโ€™s one of six pillars.

The Six Pillars

Pillar 1: Context Architecture โ€” Less Is More

Agent performance vs. context utilization follows an inverted U-curve:

Agent
Performance
    โ”‚
    โ”‚        โ•ญโ”€โ”€โ•ฎ
    โ”‚      โ•ญโ•ฏ    โ•ฐโ•ฎ
    โ”‚    โ•ญโ•ฏ        โ•ฐโ•ฎ
    โ”‚  โ•ญโ•ฏ            โ•ฐโ•ฎ
    โ”‚โ•ญโ•ฏ                โ•ฐโ•ฎ
    โ”‚โ•ฏ                    โ•ฐโ”€โ”€โ”€
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    10%  30%  40%  60%  80%
         Context Utilization

    Sweet spot: ~30-50%
    After 60%: sharp decline

The key: donโ€™t give the agent an encyclopedia. Load context progressively โ€” only whatโ€™s needed for the current task. Anthropicโ€™s Skills system embodies this: domain knowledge loads on demand, not upfront.

Pillar 2: Architectural Constraints โ€” Code Rules > Prompt Suggestions

Prompt Rules (Suggestions) Coded Constraints (Enforcement)
Model may or may not follow Linter runs automatically
Must repeat every session Enforced across all sessions
Vercel found: too many tools โ†’ confusion Remove 80% of tools โ†’ faster, more reliable

Counterintuitive insight: constraining the solution space increases output quality. Leading teams use deterministic tools (linters, hooks, type-checkers) for mechanical enforcement, not prompt suggestions the model can ignore.

Pillar 3: Reasoning Phases โ€” Match Intelligence to Stage

Not every phase needs maximum reasoning. The optimal pattern:

Phase Reasoning Level Why
1. Planning Maximum Architecture decisions have cascading consequences
2. Execution High (not max) Implementation follows the plan; save tokens
3. Verification Maximum Catching errors requires the same rigor as planning
4. Delivery Standard Packaging and cleanup

Common failure: agents get stuck in death loops โ€” editing the same file repeatedly without solving the problem. The fix: a middleware hook intercepts the agent before exit, pulling verification back to maximum reasoning to catch errors before delivery.

Pillar 4: Subagents as Context Firewalls

Donโ€™t think of subagents as โ€œhelpersโ€ โ€” think of them as context firewalls:

         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚ Parent Agent โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
     โ”‚Child A  โ”‚  โ”‚ Child B โ”‚
     โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
    โ”‚Tool     โ”‚    โ”‚Intermediateโ”‚
    โ”‚Calls    โ”‚    โ”‚Products    โ”‚
    โ”‚(isolated)โ”‚   โ”‚(isolated)  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

    Only FINAL RESULTS flow up to parent.
    Intermediate noise stays contained.

This prevents context pollution: tool calls, debugging traces, and intermediate outputs from child agents never enter the parentโ€™s context window.

Pillar 5: Entropy Governance โ€” Self-Maintaining Loops

The longer an agent runs, the more its environment drifts from its assumptions. Entropy governance means documents that serve the agent should be maintained by the agent:

         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”Œโ”€โ”€โ”€โ”€โ”‚ Outdated  โ”‚
    โ”‚    โ”‚   Docs    โ”‚
    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚Architectureโ”‚
    โ”‚         โ”‚  Drift    โ”‚
    โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Document โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ Knowledgeโ”‚
โ”‚ Curation โ”‚       โ”‚   Base   โ”‚
โ”‚  Agent   โ”‚       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ Codebase โ”‚
    โ”‚    โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”‚ Auto     โ”‚
    โ”‚    โ”‚    โ”‚ Repair   โ”‚
    โ”‚    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚    โ””โ”€โ”€โ”€โ”€โ”
    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
    โ””โ”€โ”€โ”€โ”€โ”‚Continuous โ”‚
         โ”‚ Scanning  โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pillar 6: Modular Middleware โ€” Removable by Design

The best harness architecture is a middleware stack โ€” modular layers that can be added or removed:

Layer Function Removable?
1. Agent Core Core reasoning No
2. Linter Middleware Code quality enforcement When model learns conventions
3. Verification Middleware Test execution, output validation When model becomes self-verifying
4. Edit Tracking Middleware Track all file modifications When model tracks reliably
5. Safety Guards Prevent destructive operations Possibly never

LangChainโ€™s middleware architecture is the best current reference. Each layer is designed to be removable as models evolve โ€” todayโ€™s harness compensates for todayโ€™s model weaknesses, but tomorrowโ€™s model may not need the same guardrails.

Practical Principle

Only invest in harness for errors the agent has actually made.

Donโ€™t pre-engineer for hypothetical failures. Watch your agent work, identify real failure patterns, then build harness components that prevent those specific failures from recurring. Every harness addition should trace back to an observed error.

Learn It Hands-On: learn-claude-code

learn-claude-code (30K+ stars) is the best hands-on tutorial for harness engineering. It teaches all six pillars through 12 progressive lessons, each adding exactly one mechanism:

Step Mechanism Pillar It Teaches
s01 Agent Loop Foundation
s02 Tool Use Architectural Constraints
s03 TodoWrite (Planning) Reasoning Phases
s04 Subagents Context Firewalls
s05 Skills (on-demand knowledge) Context Architecture
s06 Context Compact Context Architecture
s07 File-based Tasks Entropy Governance
s08 Background Tasks Modular Middleware
s09 Agent Teams (JSONL mailboxes) Modular Middleware
s10 Team Protocols (FSM) Modular Middleware
s11 Autonomous Agents Entropy Governance
s12 Worktree Isolation Context Firewalls

How to start:

  • Interactive web (no login): learn.shareai.run โ€” timelines, visualizations, step-by-step code walkthrough
  • Local: git clone github.com/shareAI-lab/learn-claude-code, add your Anthropic API key to .env, run python agents/s01_agent_loop.py

The core design principle: each step only adds one new capability, the core loop never changes. By step 12, youโ€™ve built a full multi-agent system with isolation, persistence, and self-healing โ€” and you understand every layer because you added them one at a time.

The Three-Agent Harness: Planner โ†’ Generator โ†’ Evaluator

Anthropicโ€™s engineering blog describes a full three-agent scaffold that ran autonomously for hours, producing production-quality frontend apps. The core insight, as analyzed by ็ˆฑๅฏๅฏ-็ˆฑ็”Ÿๆดป: Anthropic engineers borrowed from GANs (Generative Adversarial Networks) โ€” one Claude generates, another Claude critiques. Donโ€™t let the same model be both athlete and referee.

*Additional source: ๆฌงๅทด่ŠAI on Weibo (2026-03) ็ˆฑๅฏๅฏ-็ˆฑ็”Ÿๆดป on Weibo (2026-03-29) Anthropic: Harness Design for Long-Running Apps*

Why Multi-Claude? Two Stubborn Problems

Claude has two failure modes during long coding sessions:

  1. Context anxiety โ€” As the context window fills up, Claude starts cutting corners, rushing to finish, delivering half-baked results
  2. Self-review blindness โ€” Claude canโ€™t objectively evaluate its own code. Let it self-review and it always finds โ€œlooks good,โ€ even when the code is broken

The GAN-inspired fix: separate the roles structurally, not with prompts.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Planner    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Generator   โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Evaluator   โ”‚
โ”‚  (่ง„ๅˆ’่€…)     โ”‚     โ”‚  (็”Ÿๆˆ่€…)     โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚  (่ฏ„ไผฐ่€…)     โ”‚
โ”‚              โ”‚     โ”‚              โ”‚feedbackโ”‚             โ”‚
โ”‚ One sentence โ”‚     โ”‚ Writes code  โ”‚     โ”‚ Playwright   โ”‚
โ”‚ โ†’ full PRD   โ”‚     โ”‚ in chunks    โ”‚     โ”‚ MCP: clicks, โ”‚
โ”‚ with specs   โ”‚     โ”‚ with review  โ”‚     โ”‚ drags, tests โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚ checkpoints  โ”‚     โ”‚ like a human โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Agent Role Key Detail
Planner Scope expander From a one-sentence prompt, generates a detailed PRD covering animation systems, sound effects, AI-assisted design docs โ€” prevents โ€œforgot to build Xโ€ downstream
Generator Code producer Writes code in chunks. Before continuing, submits each chunk to the evaluator โ€” like signing a detailed contract with a reviewer before proceeding
Evaluator Quality gate Equipped with Playwright MCP โ€” can click buttons, drag elements, test UI in a real browser. Not just reading code, but using the output like a real person

The evaluator scores across 4 dimensions: design quality, originality, craftsmanship, and functionality. This quantitative evaluation breaks through AIโ€™s tendency toward mediocre, โ€œflatโ€ design.

Real Example: Digital Audio Workstation

Using a 3-agent scaffold with this pattern, an AI team built a complete browser-based DAW (Digital Audio Workstation) in ~4 hours for $124. The final product could compose, edit, mix audio, and even use AI to help write melodies and rhythms. A single agent couldnโ€™t do this โ€” but with the generator/evaluator loop plus a coordinator, the system produced production-quality output.

In earlier frontend experiments, the evaluator wrote 4 evaluation criteria (design quality, originality, craftsmanship, functionality) and scored pages directly. One run on a Dutch art museum website, after 10 iterations and 15 refinement rounds, the AI scrapped all previous approaches and created a 3D spatial experience with CSS perspective โ€” a creative leap that single-pass generation would never produce.

Benchmark: Single Claude vs Full Harness (Game Dev)

A head-to-head comparison on the same game development task:

ย  Single Claude Full Harness (3 agents)
Time 20 minutes 6 hours
Cost $9 $200
Core gameplay Broken โ€” controls unresponsive Fully playable
Bonus features None AI-assisted card design tool
Quality Unusable Shippable

The single Claude was faster and cheaper but produced broken output. The harness was 22x more expensive but actually worked. As one commenter put it: โ€œSelling shovels tells you the way to solve shovel problems is to buy more shovels.โ€ Fair criticism โ€” but the benchmark speaks for itself.

When the Model Upgrades, Re-Evaluate the Harness

Every harness component compensates for current model weaknesses. When the model upgrades, re-evaluate which pieces are still useful.

After Opus 4.6 shipped, the author removed sprint-based segmentation (โ€œreset context between sprintsโ€) and reduced evaluator overhead โ€” the new model was strong enough that those guardrails became unnecessary friction. As the article honestly notes: as models get stronger, some harness components become unnecessary overhead and should be dropped timely.

The Evaluator Judgment

One key heuristic worth remembering:

Whether the evaluator has value depends on whether the task exceeds what a single model can reliably complete alone. Within the modelโ€™s boundary, the evaluator is waste. At the boundary and beyond, itโ€™s the critical defense line.

The design space doesnโ€™t shrink as models improve โ€” it migrates. Intentional harness design means only keeping components that add value, and continuously finding the next valuable combination. This is what AI engineers actually do: not just prompt engineering, but discovering the right harness for the right model at the right time.

Awesome Harness Engineering โ€” Curated Reading List

walkinglabs/awesome-harness-engineering (1.1K stars) is the best curated collection of harness engineering resources โ€” papers, tools, benchmarks, and reference implementations organized into 8 categories:

Category Count Highlights
Foundations 8 OpenAI Codex field report, Anthropicโ€™s harness design article, LangChainโ€™s โ€œagent = model + harnessโ€
Context, Memory & Working State 7 Context engineering patterns, memory architectures
Constraints & Guardrails 8 Safe autonomy, permission models, tool restrictions
Specs & Agent Files 6 AGENTS.md standard, agent.md, repo-local instructions
Evals & Observability 10 Evaluation frameworks, tracing, debugging agent behavior
Benchmarks 38 SWE-bench, WebArena, OSWorld, and 35 more
Runtimes & Implementations 8 SWE-agent, Claude Agent SDK, reference harnesses

Scope rule: only includes resources that address harness design, context management, evaluation, runtime control, or reliability-critical primitives โ€” not generic agent tooling.

Learn Harness Engineering โ€” walkinglabsโ€™s Structured Course

In May 2026, the walkinglabs team (same authors as the Awesome list) shipped a structured course on harness engineering, freely available with English, Chinese, Vietnamese, Korean, and Russian translations. It is one of the more cohesive available curricula on harness engineering โ€” a 12-lecture sequence organized like a coding bootcamp, sitting between โ€œharness theoryโ€ and โ€œharness practice.โ€

Three Learning Paths

Path Content Use
่ฎฒไน‰ / Lectures 12 theoretical modules Build the mental model
้กน็›ฎ / Projects Hands-on labs Practice the patterns in real repos
่ต„ๆ–™ๅบ“ / Resources Reusable templates + code Drop into your own projects (AGENTS.md, feature_list.json, etc.)

The 12 Lecture Themes

  1. Why capable models still fail at reliable execution
  2. Defining harness engineering โ€” fundamentals
  3. The repository as the single source of truth
  4. Task isolation and initialization phases
  5. Preventing premature โ€œtask completeโ€ declarations
  6. End-to-end testing and observability for agents
  7. Session state management
  8. Constraint design โ€” prompt vs code-level
  9. Context management โ€” the inverted-U sweet spot
  10. Verification loops โ€” when to trust the agentโ€™s โ€œdoneโ€
  11. Multi-agent coordination patterns
  12. Knowing when to remove harness as models improve

Project 01: Prompt-Driven vs. Rule-Driven

The first hands-on lab โ€” directly comparable to Pillar 2 of this entry. Two implementations of the same agent task: one steered by prompt instructions, one steered by coded rules (linters, hooks, type checks). Students measure failure rate. The labโ€™s punchline lines up with the Vercel finding: prompt suggestions degrade across sessions; coded constraints donโ€™t.

Why This Course Matters

Most harness-engineering content is still fragmented across Anthropic blog posts, X threads, and conference talks. walkinglabsโ€™s course is a systematic curriculum that treats harness engineering as a teachable discipline. Combined with their Awesome list (the reading library) and the broader open-source harness reference implementations (like OpenHarness, below), thereโ€™s now a coherent learn-by-doing stack:

walkinglabs + community stack:
โ”œโ”€โ”€ awesome-harness-engineering    โ† what to read
โ”œโ”€โ”€ learn-harness-engineering      โ† how to learn (this course)
โ””โ”€โ”€ OpenHarness (HKUDS, separate)  โ† reference implementation to read the code of

OpenHarness (oh) โ€” Open-Source Agent Harness in Python

HKUDS/OpenHarness from HKUโ€™s Data Intelligence Lab is an open-source Python implementation of the agent harness pattern โ€” essentially a research-friendly reimplementation of Claude Codeโ€™s architecture. Hit 1.9K stars in 2 days.

# One-command install
curl -fsSL https://raw.githubusercontent.com/HKUDS/OpenHarness/main/scripts/install.sh | bash
oh  # Launch

10 Subsystems

Subsystem What It Does
Engine Agent loop with streaming, tool-call cycles, retry logic
Tools 43+ built-in (file I/O, shell, search, web, notebooks)
Skills On-demand knowledge from markdown files
Plugins Extension ecosystem (commands, hooks, agents, MCP)
Permissions Multi-level safety modes with path/command rules
Hooks PreToolUse/PostToolUse lifecycle events
Commands 54 CLI commands for workflow control
MCP Model Context Protocol client integration
Memory Persistent cross-session storage + auto-compression
Tasks Background task lifecycle management

Works with Anthropic, OpenAI, DeepSeek, Moonshot/Kimi, Ollama, and GitHub Copilot. Python 3.10+ required.

Why it matters: Turns Claude Codeโ€™s architecture from a black box into a white box. You can read every line, modify every subsystem, and experiment with harness design without reverse-engineering TypeScript.

How LearnAI Team Could Use This

  • Teach agent reliability as system design โ€” frame agents as model plus harness, not just better prompts or better models.
  • Build progressive labs โ€” have learners add context limits, tool constraints, verification hooks, subagents, and worktree isolation one layer at a time.
  • Audit existing AI workflows โ€” identify where failures come from missing constraints, weak feedback loops, or overloaded context.
  • Compare single-agent vs multi-agent work โ€” use planner/generator/evaluator patterns to show when extra harness cost is justified.

Real-World Use Cases

  • Long-running coding agents โ€” keep autonomous work reliable with planning, verification, middleware, and scoped tools.
  • Production AI development teams โ€” create repeatable guardrails for code generation, QA, deployment, and documentation.
  • Agent benchmark improvement โ€” improve results by changing prompts, tools, middleware, and eval loops without changing the base model.
  • Research and teaching platforms โ€” expose agent internals so students can understand, modify, and measure each harness layer.

Key Takeaway

Agent็š„ๅฏ้ ๆ€ง็“ถ้ขˆ๏ผŒไธๅœจๆจกๅž‹๏ผŒๅœจๆจกๅž‹ๅ‘จๅ›ด็š„็ณป็ปŸใ€‚ The reliability bottleneck of agents isnโ€™t the model โ€” itโ€™s the system around the model.

Without steering and brakes, even the most powerful engine canโ€™t reach its destination.