These pages cover how to structure agents, configure coding and review workflows, and keep agent sessions small enough for reliable delivery.
This is the multi-page printable view of this section. Click here to print.
Agent Architecture
1 - Agentic Architecture Patterns
Agentic workflow architecture is a software design problem. The same principles that prevent spaghetti code in application software - single responsibility, well-defined interfaces, separation of concerns - prevent spaghetti agent systems. The cost of getting it wrong is measured in token waste, cascading failures, and workflows that break when you swap one model for another.
This page assumes familiarity with Agent Delivery Contract. After reading this page, see Coding & Review Setup for a concrete implementation of these patterns applied to coding and pre-commit review.
Overview
A multi-agent system that was not deliberately designed looks like a distributed monolith: everything depends on everything else, context passes unchecked through every boundary, and no component has clear ownership. Add token costs to the usual distributed systems failure modes and the problem compounds: a carelessly assembled context bundle that reaches a frontier model five times per workflow iteration is not a minor inefficiency, it is a recurring tax on every workflow run.
Three failure patterns appear consistently in poorly structured agentic systems:
Token waste from undisciplined context. Without explicit rules about what passes between components, agents accumulate context until the window fills or costs spike. An agent that receives a 50,000-token context when its actual task requires 5,000 tokens wastes 90% of its input budget on every invocation.
Cascading failures from missing error boundaries. When one agent’s unstructured prose output becomes another agent’s input, parsing ambiguity becomes a failure source. A model that produces a slightly different output format than expected on one run can silently corrupt downstream agent behavior without triggering any explicit error.
Brittle workflows from model-coupled instructions. Skills and commands written for one model’s specific instruction style often degrade when run on a different model. Workflows that hard-code model-specific behaviors - Claude’s particular handling of XML tags, Gemini’s response to certain role descriptions - cannot be handed off or used in multi-model configurations without manual rewriting.
Getting architecture right addresses all three. The sections below give patterns for each component type: skills, agents, commands, hooks, and the cross-cutting concerns that tie them together.
Key takeaways:
- Undisciplined context passing is the primary cost driver in agentic systems.
- Structured outputs at every agent boundary eliminate parsing-based cascade failures.
- Model-agnostic design is achievable by separating task logic from model-specific invocation details.
Skills
What a Skill Is
A skill is a named, reusable procedure that an agent can invoke by name. It encodes a sequence of steps, a set of rules, or a decision procedure that would otherwise need to be re-derived from scratch each time the agent encounters a given situation.
Skills are not plugins or function calls in the API sense. They are instruction documents - typically markdown files - that are injected into an agent’s context when invoked. The agent reads the skill, follows its instructions, and returns a result. The skill has no runtime; it is pure specification.
This distinction matters. Because a skill is just text, it works across models that can read and follow natural language instructions. Claude, Gemini, and any other capable model can follow the same skill document. This is the foundation of model-agnostic workflow design.
Single Responsibility
A skill should do one thing. The temptation to combine related procedures into a single skill (“review code AND write the commit message AND update the changelog”) produces a skill that is hard to test, hard to maintain, and hard to invoke selectively. When a multi-step procedure fails, a single-responsibility skill makes it obvious which step went wrong and where to look.
Signs a skill is doing too much:
- The skill name contains “and”
- The skill has conditional branches that activate completely different code paths depending on input
- Different sub-agents invoke the skill but only use half of it
Signs a skill should be extracted:
- The same sequence of steps appears in two or more larger skills
- A step in a skill has grown to match the complexity of the skill itself
- A sub-agent needs only part of a skill’s behavior but must receive all of it
When to Inline vs. Extract
Inline instructions when a procedure is used exactly once, is tightly coupled to the specific agent’s context, or is too short to justify its own file (under 5-6 lines of instruction). Extract to a skill file when a procedure is reused, when it will be maintained independently of the agent configuration, or when it is long enough that reading the agent’s system prompt requires scrolling past it.
A useful test: if you replaced the inline instruction with a skill reference, would the agent system prompt read more clearly? If yes, extract it.
File and Folder Structure
Organize skills in a flat or two-level hierarchy within a skills/ directory. Avoid deeply nested skill trees - when an agent needs to invoke a skill, it should be obvious where to find it.
Keeping separate skills/ directories per model is not duplication if the skills differ in ways specific to that model’s behavior. It is a problem if the skills differ only because they were written at different times by different people without a shared template. The goal is model-agnostic skills that live in a shared location; model-specific variants should be the exception and should be explicitly labeled as such.
Writing Model-Agnostic Skill Instructions
Skills written to exploit one model’s specific behaviors create lock-in. The following practices produce skills that transfer well:
Use explicit imperative steps, not conversational prose. Both Claude and Gemini follow numbered step lists more reliably than embedded instructions in flowing text.
State output format explicitly. Do not assume a model will infer the desired output format from context. Specify it. “Return a JSON object with the schema shown below” is unambiguous. “Return the results” is not.
Avoid model-specific XML or prompt syntax. Claude responds to <instructions> tags; Gemini does not require them. Skills that depend on XML delimiters need adaptation when moved between models. Use plain markdown structure instead.
State scope and early exit conditions. Both models benefit from explicit scope limits (“analyze only the files in the staged diff”) and early exit conditions (“if the diff contains only comments and whitespace, return an empty findings list immediately”). These reduce unnecessary processing and keep outputs predictable.
Claude Implementation Example
Gemini Implementation Example
The same skill for Gemini. The task logic is identical. The structural differences reflect Gemini’s preference for explicit role framing and its handling of early exit conditions:
The differences are explicit: Gemini benefits from named input fields (bdd_scenario, test_file) and an explicit role statement. Claude handles the simpler inline description of inputs without role framing. Both produce the same JSON output, which means the skill is interchangeable at the orchestration layer even though the instruction text differs.
Key takeaways:
- Skills are instruction documents, not code. They work across any model that can follow natural language instructions.
- Single responsibility prevents unclear failure attribution and oversized context bundles.
- Model-agnostic skills share task logic; model-specific variants differ only in structural framing, not in output contract.
Agents
Defining Agent Boundaries
An agent boundary is a context boundary and a responsibility boundary. What an agent knows, what it can do, and what it must return are determined by what crosses the boundary.
Define boundaries by asking: what is the smallest coherent unit of work this agent can own? “Coherent” means the agent can complete its work without reaching outside its assigned context. An agent that regularly requests additional files, broader system context, or information from other agents mid-task has a boundary problem - its responsibility was scoped incorrectly.
Responsibility and context are coupled. An agent with a narrow responsibility needs a small context. An agent with a broad responsibility needs a large context and likely should be decomposed.
When One Agent Is Enough
Use a single agent when:
- The workflow has one clear task with a well-scoped context requirement
- The work is short enough to complete within a single context window without degradation
- There is no meaningful parallelism available (each step depends on the previous step’s output)
- The cost of the inter-agent communication overhead exceeds the cost of doing the work in a single agent
Decomposing into multiple agents introduces latency, context assembly overhead, and additional failure surfaces. Do not decompose for the sake of architectural elegance. Decompose when there is a concrete benefit: parallelism, context budget enforcement, or specialized model routing.
When to Decompose
Decompose when:
- Parallel execution is possible and would meaningfully reduce latency (review sub-agents running concurrently instead of sequentially)
- Different tasks within a workflow have different model tier requirements (routing cheap coordination to a small model, expensive reasoning to a frontier model)
- A task has grown too large to fit in a single well-scoped context without degrading output quality
- Separation of concerns requires that one agent not be able to see or influence another agent’s domain (the implementation agent must not perform its own review)
Passing Context Without Bloat
Context passed between agents must be explicitly scoped. The default should be “send only what this agent needs,” not “send everything the orchestrator has.”
Rules for inter-agent context:
- Define a schema for what each agent receives. Treat it like an API contract.
- Send structured data (JSON, YAML) rather than prose summaries. Prose requires the receiving agent to parse intent; structured data makes intent explicit.
- Strip conversation history at every boundary. The receiving agent needs the result of prior work, not the reasoning that produced it.
- Send diffs, not full file contents, when the agent’s task is about changes.
Handling Failure Modes
Agent failures fall into three categories, each requiring a different response:
Hard failure (the agent returns an error or a malformed response). Retry once with identical input. If the second attempt fails, escalate to the orchestrator with the raw error; do not attempt to interpret it in the calling agent.
Soft failure (the agent returns a valid response indicating a blocking issue). This is not a failure of the agent - it is the agent doing its job. Route the finding to the appropriate handler (typically returning it to the implementation agent for resolution) without treating it as an error condition.
Silent degradation (the agent returns a valid-looking response that is subtly wrong). This is the hardest failure mode to detect. Defend against it with output schemas and schema validation at every boundary. A response that does not conform to the expected schema should be treated as a hard failure, not silently accepted.
Multi-Agent Pipeline Example: Release Readiness Checks
The following example shows a release readiness pipeline with Claude as orchestrator and Gemini as a specialized long-context sub-agent. A release candidate artifact is routed to three parallel checks - changelog completeness, documentation coverage, and dependency audit - each receiving only what its specific check requires.
This configuration makes sense when the changelog or dependency manifest is large enough that a single-agent approach risks context window degradation. Gemini handles the large-context changelog analysis; Claude handles routing and the two lighter checks.
Orchestrator (Claude) - context assembly and routing:
Changelog review sub-agent (Gemini) - specialized for long changelog analysis:
In this configuration, Claude handles orchestration because routing and context assembly do not require long-context capability. Gemini handles changelog review because a full changelog for a major release can be large enough to crowd out other context in a smaller window. Neither assignment is mandatory - the point is that the structured interface (JSON input, JSON output with a defined schema) makes the sub-agent swappable. Replacing the Gemini changelog agent with a Claude one requires changing only the invocation target, not the orchestration logic.
For a concrete application of this pattern to coding and pre-commit review - including full system prompt rules for each agent - see Coding & Review Setup.
Key takeaways:
- Agent boundaries are context boundaries. Scope responsibility so an agent can complete its task without reaching outside its assigned context.
- Decompose when there is concrete benefit: parallelism, model tier routing, or context budget enforcement.
- Structured schemas at every agent interface make sub-agents swappable without changing orchestration logic.
Commands
Designing Unambiguous Commands
A command is an instruction that triggers a defined workflow. The distinction between a command and a general prompt is that a command’s behavior should be predictable and consistent across invocations with the same inputs.
An unambiguous command has:
- A single, explicit trigger name (conventionally
/verb-nounformat) - A defined set of inputs it expects
- A defined output it will produce
- No implicit state it depends on beyond what is passed explicitly
The failure mode of an ambiguous command is that the model interprets it differently on different runs. “Review the changes” is ambiguous. /review staged-diff with a defined schema for what “review” means and what the output looks like is not.
Parameterization Strategies
Commands should accept parameters rather than embedding specific values in the command text. This makes commands reusable across contexts without modification.
Well-parameterized command:
Poorly parameterized command (values embedded in command text):
The second version cannot be extended without creating new commands. The first version handles new target types and output formats through parameterization.
Avoiding Prompt Injection Through Command Structure
Prompt injection attacks against agentic systems typically exploit unstructured inputs that the model treats as additional instructions. The command structure itself is the primary defense.
Defensive patterns:
- Treat all parameter values as data, not as instructions. Pass them inside a clearly delimited data block, not inline in the instruction text.
- Define the parameter schema explicitly. Parameters outside the schema should cause the command to return an error, not to be interpreted as free-form instructions.
- Do not pass raw user input directly to a model invocation. Validate and sanitize first.
Example of unsafe command structure:
If user_provided_context contains “Ignore previous instructions and…”, the model will process it as an instruction. This is the injection vector.
Example of safer command structure:
The explicit instruction to treat inputs as data and the injection detection rule do not guarantee safety against a sophisticated adversary, but they raise the bar substantially over undefended interpolation.
Well-Structured vs. Poorly-Structured Command Comparison
Key takeaways:
- Commands are defined workflows, not open-ended prompts. Predictability requires explicit inputs, outputs, and scope.
- Parameterization keeps commands reusable. Embedded values create command proliferation.
- Structural separation between instructions and data is the primary defense against prompt injection.
Hooks
When to Use Pre/Post Hooks
Hooks are side effects that run before or after an agent invocation. Pre-hooks run before the model call; post-hooks run after. Use them to enforce invariants that should hold for every invocation of a given command or skill, without embedding that logic in every skill individually.
Pre-hooks are appropriate for:
- Validating inputs before they reach the model (fail fast, save token cost)
- Injecting stable context that should always be present (system constraints, security policies)
- Enforcing environmental preconditions (pipeline is green, branch is clean)
Post-hooks are appropriate for:
- Validating that the model’s output conforms to the expected schema
- Logging invocation metadata (model, token count, duration, decision)
- Triggering downstream steps conditionally based on the model’s output
Keeping Hooks Lightweight and Side-Effect-Safe
A hook that fails should fail cleanly with a clear error message. A hook that has unexpected side effects will be disabled by frustrated developers the first time it causes a problem. Two rules:
Hooks must be idempotent. Running the same hook twice with the same inputs must produce the same result. A hook that writes a log file should append to an existing file, not fail if the file already exists. A hook that calls an external validation service must handle the case where the same call was already made.
Hooks must have bounded execution time. A pre-hook that can run for an arbitrary duration blocks the agent invocation. Set timeouts. If the hook cannot complete within its timeout, fail fast and surface the timeout as the error - do not silently allow the invocation to proceed with unvalidated inputs.
Using Hooks to Enforce Guardrails or Inject Context
Pre-hooks are the right place for guardrails that must apply regardless of the skill being invoked. Rather than duplicating a guardrail across every skill document, implement it once as a pre-hook:
The inject-system-constraints hook demonstrates the context injection pattern. Rather than including system constraints in every skill document, the hook injects them at invocation time. This guarantees they are always present without creating maintenance risk from outdated copies embedded in individual skill files.
A Cross-Model Hook Example
The following hook works identically regardless of whether Claude or Gemini is being invoked. It validates that the agent’s output conforms to the expected JSON schema before the orchestrator processes it.
This hook exits with a non-zero code if the output is malformed, which causes the orchestrator to treat the invocation as a hard failure. The hook does not know or care whether the output came from Claude or Gemini - it validates the contract, not the model.
Key takeaways:
- Pre-hooks enforce preconditions; post-hooks validate outputs. Both must be idempotent and bounded in execution time.
- Guardrails implemented as hooks apply universally without being duplicated across skill documents.
- Output schema validation as a post-hook is the primary defense against silent degradation at agent boundaries.
Cross-Cutting Concerns
Logging and Observability
Every agent invocation should produce a structured log record. Debugging an agentic workflow without structured logs is impractical - invocations are non-deterministic, inputs vary, and failures manifest differently across runs.
Minimum log record per invocation:
Track at the workflow level, not the call level. A single /review command may invoke four sub-agents. The relevant metric is total token cost and duration for the /review command, not the cost of each sub-agent call in isolation.
Both Claude and Gemini expose token counts in their API responses. Claude exposes them under usage.input_tokens and usage.output_tokens with separate fields for cache_read_input_tokens and cache_creation_input_tokens. Gemini exposes them under usageMetadata.promptTokenCount and usageMetadata.candidatesTokenCount. Normalize these into a shared log schema in your orchestration layer.
Idempotency
Agentic workflows will be retried - by developers manually, by CI systems automatically, and by error recovery paths. A workflow that is not idempotent will produce inconsistent state when retried.
Rules for idempotent agent workflows:
- Assign a stable ID to each workflow run at start time. Use this ID for deduplication in any downstream systems the workflow touches.
- Agent invocations that produce the same output for the same input are naturally idempotent. State-changing side effects (writing files, calling external APIs) require explicit deduplication.
- Write-once outputs (session summaries, review findings written to a file) should check for existing output before writing. A retry that overwrites a passing review finding with a new failing one has broken idempotency.
Testing Agentic Workflows
Testing agentic workflows requires testing at multiple levels:
Skill unit tests. Test each skill document in isolation by invoking it with controlled inputs and asserting on the output structure. Use a deterministic input set (a known diff, a known scenario) and verify that the output schema is correct and that the decision matches expectations.
Agent integration tests. Test the full agent with a controlled context bundle. These tests will not be perfectly deterministic across model versions, but they should produce consistent structural outputs (valid JSON, correct schema, plausible decisions) for a given stable input.
Workflow end-to-end tests. Test the full workflow path with a representative scenario. These are slower and more expensive but necessary to catch problems that only emerge at the orchestration layer, such as context assembly bugs or incorrect routing decisions.
A useful heuristic: if a skill cannot be tested with a controlled input-output pair, it is not well-scoped enough. The ability to write a unit test for a skill is a signal that the skill has a clear responsibility and a defined contract.
Model-Agnostic Abstraction Layer
The abstraction layer between your workflow logic and the specific model API is the most important structural decision in a multi-model agentic system. Without it, every change in model availability, pricing, or capability requires changes throughout the orchestration logic.
A minimal abstraction layer defines a ModelClient interface with a single invoke method that accepts a context bundle and returns a structured response:
With this layer in place, the orchestrator does not reference Claude or Gemini directly. It holds a ModelClient reference and calls invoke(). Swapping models means changing the client instantiation at configuration time, not rewriting orchestration logic.
Where Claude and Gemini differ at the API level:
- System prompt placement. Claude separates system content via the
systemparameter. Gemini usessystemInstruction. Your abstraction layer must handle this mapping. - Prompt caching. Claude’s prompt caching uses cache-control annotations on specific message blocks. Gemini’s implicit caching triggers automatically on long stable prefixes. Caching strategies differ and cannot be abstracted into a single identical interface - expose caching as an optional configuration, not a required behavior.
- Structured output support. Claude returns structured outputs through its response format parameter (JSON mode). Gemini supports structured output through
responseMimeTypeandresponseSchemain the generation config. If your workflows require structured output enforcement at the API level (beyond instructing the model in the prompt), handle this in the concrete client implementations, not in the abstraction layer. - Token counting. The field names differ (noted in the Logging section above). Normalize in the abstraction layer.
Key takeaways:
- Every agent invocation should emit a structured log record with token counts and duration.
- Idempotency requires explicit deduplication for any state-changing side effects in a workflow.
- A model-agnostic abstraction layer is the single most important structural investment for multi-model systems.
Anti-patterns
1. The Monolithic Orchestrator
What it looks like: One agent handles orchestration, implementation, review, and summarization. It receives the full project context on every invocation and runs to completion in a single long-running session.
Why it fails: Context accumulates until quality degrades or the window fills. There is no opportunity to route subtasks to cheaper models. A failure anywhere in the monolithic run requires restarting from the beginning. The agent cannot be parallelized.
What to do instead: Decompose into an orchestrator with single-responsibility sub-agents. Each agent receives only the context its task requires. The orchestrator coordinates; it does not execute.
2. Natural Language Agent Interfaces
What it looks like: Agents communicate by passing prose summaries to each other. “The implementation agent completed the login feature. The tests pass and the code looks good. Please proceed with the review.”
Why it fails: Prose is ambiguous. A downstream agent must parse intent from the prose, which introduces a failure point that becomes more likely as model outputs vary between invocations. Prose is also token-inefficient: the same information encoded as JSON takes fewer tokens and is unambiguous.
What to do instead: Define a JSON schema for every agent interface. Agents return structured data. Orchestrators parse structured data. Natural language is reserved for human-readable summaries, not inter-agent communication.
3. Context That Does Not Expire
What it looks like: Session context grows continuously. Prior session conversations are appended rather than summarized. The implementation agent receives the full history of all prior sessions because “it might need it.”
Why it fails: Context that does not expire grows without bound. Token costs increase linearly with context size. Model performance on tasks can degrade as context grows, particularly for tasks in the middle of a large context window. Context that is always present but rarely relevant is a tax on every invocation.
What to do instead: Summarize at session boundaries. A session summary of 100-150 words replaces a full session conversation for future contexts. The summary contains what the next session needs - not what happened, but what exists and what state the system is in.
4. Skills Written for One Model’s Idiosyncrasies
What it looks like: Skills use Claude-specific XML delimiters (<examples>, <context>), or Gemini-specific role framing that other models do not respond to. The skill file has comments like “this only works on Claude Opus.”
Why it fails: Model-specific skills create lock-in. A skill library that cannot be used with a different model cannot survive a pricing change, a capability change, or an organizational decision to switch providers. Testing is harder because the skill cannot be validated against a cheaper model during development.
What to do instead: Write skills using plain markdown structure. Numbered steps, explicit input/output schemas, and early exit conditions work consistently across capable models. When a model-specific variant is genuinely necessary, isolate it in a model-specific subdirectory and document why it differs.
5. Missing Output Schema Validation
What it looks like: The orchestrator passes an agent’s response directly to the next step without validating that the response conforms to the expected schema. If the model produces a slightly malformed JSON object, the downstream step either fails with an opaque error or silently processes incorrect data.
Why it fails: Models do not produce perfectly consistent structured output on every invocation. Occasional schema violations are normal and expected. Without validation, these violations propagate downstream before manifesting as failures, making the root cause hard to trace.
What to do instead: Validate schema at every agent boundary using a post-invoke hook. A non-conforming response is a hard failure at the boundary where it occurred, not an opaque error two steps downstream.
6. Hooks With Unconstrained Side Effects
What it looks like: A pre-hook makes a network call to an external service to validate an input. The external service is occasionally slow or unavailable. On slow runs, the hook blocks the agent invocation for several minutes. On unavailability, the hook fails in a way that leaves partial state in the external service.
Why it fails: Hooks with unconstrained side effects are unpredictable. A hook that can fail in an unclean way, block for an unbounded duration, or write partial state to an external system will be disabled by the team after the first time it causes a production incident or a corrupted workflow run.
What to do instead: Hooks must have explicit timeouts. All external calls in hooks must be idempotent. A hook that cannot complete idempotently within its timeout must fail fast and surface the timeout as a clear error, not silently allow the invocation to proceed.
7. Swapping Models Without Adjusting Context Structure
What it looks like: A workflow designed for Claude is migrated to Gemini by changing only the API call. The skill documents, context assembly order, and prompt structure remain unchanged.
Why it fails: Claude and Gemini have different behaviors around context structure. Prompt caching works differently (Claude requires explicit cache annotations; Gemini uses implicit prefix matching). System prompt handling differs. Some instruction patterns that Claude follows reliably require adjustment for Gemini. A direct swap without validation produces degraded and unpredictable outputs.
What to do instead: Treat a model swap as a migration, not a configuration change. Test each skill against the new model with controlled inputs. Adjust context structure, system prompt placement, and output instructions as needed. Use the model-agnostic abstraction layer so that only the concrete client and the per-model skill variants need to change.
Related Content
- Coding & Review Setup - a concrete orchestrator and sub-agent configuration applying these patterns
- Tokenomics - the full optimization framework for token cost management
- Small-Batch Sessions - how session discipline maps to the skill and hook patterns here
- Pipeline Enforcement and Expert Agents - how the same agent patterns operate as CI pipeline gates
- Agent Delivery Contract - the structured artifacts that flow between agents as defined interfaces
- Pitfalls and Metrics - failure modes and measurement for agentic workflows
2 - Coding & Review Setup
Standard pre-commit tooling catches mechanical defects. The agent configuration described here covers what standard tooling cannot: semantic logic errors, subtle security patterns, missing timeout propagation, and concurrency anti-patterns. Both layers are required. Neither replaces the other.
For the pre-commit gate sequence this configuration enforces, see the Pipeline Reference Architecture. For the defect sources each gate addresses, see the Systemic Defect Fixes catalog.
System Architecture
The coding agent system has two tiers. The orchestrator manages sessions and routes work. Specialized agents execute within a session’s boundaries. Review sub-agents run in parallel as a pre-commit gate, each responsible for exactly one defect concern.
graph TD
classDef orchestrator fill:#224968,stroke:#1a3a54,color:#fff
classDef agent fill:#0d7a32,stroke:#0a6128,color:#fff
classDef review fill:#30648e,stroke:#224968,color:#fff
classDef subagent fill:#6c757d,stroke:#565e64,color:#fff
ORC["Orchestrator<br/><small>Session management · Context control · Routing</small>"]:::orchestrator
IMPL["Implementation Agent<br/><small>One BDD scenario per session</small>"]:::agent
REV["Review Orchestrator<br/><small>Pre-commit gate · Parallel coordination</small>"]:::review
SEM["Semantic Review<br/><small>Logic · Edge cases · Intent alignment</small>"]:::subagent
SEC["Security Review<br/><small>Injection · Auth gaps · Audit trails</small>"]:::subagent
PERF["Performance Review<br/><small>Timeouts · Resource leaks · Degradation</small>"]:::subagent
CONC["Concurrency Review<br/><small>Race conditions · Idempotency</small>"]:::subagent
ORC -->|"implement"| IMPL
ORC -->|"review staged changes"| REV
REV --> SEM & SEC & PERF & CONCSeparation principle: The orchestrator does not write code. The implementation agent does not review code. Review agents do not modify code. Each agent has one responsibility. This is the same separation of concerns that pipeline enforcement applies at the CI level - brought to the pre-commit level.
Every agent boundary is a token budget boundary. What the orchestrator passes to the implementation agent, what it passes to the review orchestrator, and what each sub-agent receives and returns are all token cost decisions. The configuration below applies the tokenomics strategies concretely: model routing by task complexity, structured outputs between agents, prompt caching through stable system prompts placed first in each context, and minimum-necessary-context rules at every boundary.
This page defines the configuration for each component in order: Orchestrator, Implementation Agent, Review Orchestrator, and four Review Sub-Agents. The Skills section defines the session procedures each component uses. The Hooks section defines the pre-commit gate sequence. The Token Budget section applies the tokenomics strategies to this configuration.
The Orchestrator
The orchestrator manages session lifecycle and controls what context each agent receives. It does not generate implementation code. Its job is routing and context hygiene.
Recommended model tier: Small to mid. The orchestrator routes, assembles context, and writes session summaries. It does not reason about code. A frontier model here wastes tokens on a task that does not require frontier reasoning. Claude: Haiku. Gemini: Flash.
Responsibilities:
- Initialize each session with the correct context subset (per Small-Batch Sessions)
- Delegate implementation to the implementation agent
- Trigger the review orchestrator when the implementation agent reports completion
- Write the session summary on commit and reset context for the next session
- Enforce the pipeline-red rule (ACD constraint 8): if the pipeline is failing, route only to pipeline-restore mode; block new feature work
Rules injected into the orchestrator system prompt. The context assembly order below follows the general pattern from Configuration Quick Start: Context Loading Order, applied to this specific agent configuration:
The Implementation Agent
The implementation agent generates test code and production code for the current BDD scenario. It operates within the context the orchestrator provides and does not reach outside that context.
Recommended model tier: Mid to frontier. Code generation and test-first implementation require strong reasoning. This is the highest-value task in the session - invest model capability here. Output verbosity should be controlled explicitly: the agent returns code only, not explanations or rationale, unless the orchestrator requests them. Claude: Sonnet or Opus. Gemini: Pro.
Receives from the orchestrator:
- Intent summary
- The one BDD scenario for this session
- Feature description (constraints, architecture, performance budgets)
- Relevant existing files
- Prior session summary
Rules injected into the implementation agent system prompt:
The Review Orchestrator
The review orchestrator runs between implementation complete and commit. It invokes all four review sub-agents in parallel against the staged diff, collects their findings, and returns a single structured decision.
Recommended model tier: Small. The review orchestrator does no reasoning itself - it invokes sub-agents and aggregates their structured output. A small model handles this coordination cheaply. Claude: Haiku. Gemini: Flash.
Receives:
- The staged diff for this session
- The BDD scenario being implemented (for intent alignment checks)
- The feature description (for architectural constraint checks)
Returns: A JSON object so the orchestrator can parse findings without a natural language step. Structured output here eliminates ambiguity and reduces the token cost of the aggregation step.
An empty findings array with "decision": "pass" means all sub-agents passed. A
non-empty findings array always accompanies "decision": "block".
Rules injected into the review orchestrator system prompt:
Review Sub-Agents
Each sub-agent covers exactly one defect concern from the Systemic Defect Fixes catalog. They receive only the diff and the artifacts relevant to their specific check - not the full session context.
Semantic Review Agent
Recommended model tier: Mid to frontier. Logic correctness and intent alignment require genuine reasoning - a model that can follow execution paths, infer edge cases, and compare implementation against stated intent. Claude: Sonnet or Opus. Gemini: Pro.
Defect sources addressed:
- Reliance on human review to catch preventable defects (Process & Deployment)
- Implicit domain knowledge not in code (Knowledge & Communication)
- Untested edge cases and error paths (Testing & Observability Gaps)
What it checks:
- Logic correctness: does the implementation produce the outputs the scenario specifies?
- Edge case coverage: does the implementation handle boundary values and error paths, or only the happy path the scenario explicitly describes?
- Intent alignment: does the implementation address the problem stated in the intent summary, or does it technically satisfy the test while missing the point?
- Test coupling: does the test verify observable behavior, or does it assert on implementation internals? (See Implementation Coupling Agent)
System prompt rules:
Security Review Agent
Recommended model tier: Mid to frontier. Identifying second-order injection, subtle authorization gaps, and missing audit events requires understanding data flow semantics, not just pattern matching. A smaller model will miss the cases that matter most. Claude: Sonnet or Opus. Gemini: Pro.
Defect sources addressed:
- Injection vulnerabilities (subtle patterns beyond basic SAST) (Security & Compliance)
- Authentication and authorization gaps (Security & Compliance)
- Missing audit trails (Security & Compliance)
What it checks:
- Second-order injection and injection vectors that pattern-matching SAST rules miss
- Code paths that process user-controlled input without validation at the boundary
- State-changing operations that lack an authorization check
- State-changing operations that do not emit a structured audit event
- Privilege escalation patterns
Context it receives:
- Staged diff only; no broader system context needed
System prompt rules:
Performance Review Agent
Recommended model tier: Small to mid. Timeout and resource leak detection is primarily structural pattern recognition: find external calls, check for timeout configuration, trace resource allocations to their cleanup paths. A small to mid model handles this well and runs cheaply enough to be invoked on every commit without concern. Claude: Haiku or Sonnet. Gemini: Flash.
Defect sources addressed:
- Missing timeout and deadline enforcement (Performance & Resilience)
- Resource leaks (Performance & Resilience)
- Missing graceful degradation (Performance & Resilience)
What it checks:
- External calls (HTTP, database, queue, cache) without timeout configuration
- Timeout values that are set but not propagated through the call chain
- Resource allocations (connections, file handles, threads) without corresponding cleanup
- Calls to external dependencies with no fallback or circuit breaker when the feature description specifies a resilience requirement
Context it receives:
- Staged diff
- Feature description (for performance budgets and resilience requirements)
System prompt rules:
Concurrency Review Agent
Recommended model tier: Mid. Concurrency defects require reasoning about execution ordering and shared state - more than pattern matching but less open-ended than security semantics. A mid-tier model balances reasoning depth and cost here. Claude: Sonnet. Gemini: Pro.
Defect sources addressed:
- Race conditions (anti-patterns beyond thread sanitizer detection) (Integration & Boundaries)
- Concurrency and ordering issues (Data & State)
What it checks:
- Shared mutable state accessed from concurrent paths without synchronization
- Operations that assume a specific ordering without enforcing it
- Anti-patterns that thread sanitizers cannot detect at static analysis time: check-then-act sequences, non-atomic read-modify-write operations, and missing idempotency in message consumers
System prompt rules:
Skills
Skills are reusable session procedures invoked by name. They encode the session discipline
from Small-Batch Sessions so the orchestrator does not have to
re-derive it each time. A normal session runs /start-session, then /review, then /end-session. Use /fix only when the pipeline fails mid-session.
/start-session
Loads the session context and prepares the implementation agent.
/review
Invokes the review orchestrator against all staged changes.
/end-session
Closes the session, validates all gates, writes the summary, and commits.
/fix
Enters pipeline-restore mode when the pipeline is red.
Hooks
Hooks run automatically as part of the commit process. They execute standard tooling - fast, deterministic, and free of AI cost - before the review orchestrator runs. The review orchestrator only runs if the hooks pass.
Pre-commit hook sequence:
Why the hook sequence matters: Standard tooling runs first because it is faster and cheaper than AI review. If the linter fails, there is no reason to invoke the review orchestrator. Deterministic checks fail fast; AI review runs only on changes that pass the baseline mechanical checks.
Token Budget
A rising per-session cost with a stable block rate means context is growing unnecessarily. A rising block rate without rising cost means the review agents are finding real issues without accumulating noise. Track these two signals and the cause of any cost increase becomes immediately clear.
The tokenomics strategies apply directly to this configuration. Three decisions have the most impact on cost per session.
Model routing
Matching model tier to task complexity is the highest-leverage cost decision. Applied to this configuration:
| Agent | Recommended Tier | Claude | Gemini | Why |
|---|---|---|---|---|
| Orchestrator | Small to mid | Haiku | Flash | Routing and context assembly; no code reasoning required |
| Implementation Agent | Mid to frontier | Sonnet or Opus | Pro | Core code generation; the task that justifies frontier capability |
| Review Orchestrator | Small | Haiku | Flash | Coordination only; returns structured output from sub-agents |
| Semantic Review | Mid to frontier | Sonnet or Opus | Pro | Logic and intent reasoning; requires genuine inference |
| Security Review | Mid to frontier | Sonnet or Opus | Pro | Security semantics; pattern-matching is insufficient |
| Performance Review | Small to mid | Haiku or Sonnet | Flash | Structural pattern recognition; timeout and resource signatures |
| Concurrency Review | Mid | Sonnet | Pro | Concurrent execution semantics; more than patterns, less than security |
Running the implementation agent on a frontier model and routing the review orchestrator and performance review agent to smaller models cuts the token cost of a full session substantially compared to using one model for everything.
Prompt caching
Each agent’s system prompt rules block is stable across every invocation. Place it at the top of every agent’s context - before the diff, before the session summary, before any dynamic content. This structure allows the server to cache the rules prefix and amortize its input cost across repeated calls.
The /start-session and /review skills assemble context in this order:
- Agent system prompt rules (stable - cached)
- Feature description (stable within a feature - often cached)
- BDD scenario for this session (changes per session)
- Staged diff or relevant files (changes per call)
- Prior session summary (changes per session)
Measuring cost per session
Track token spend at the session level, not the call level. A session that costs 10x the average is a design problem - usually an oversized context bundle passed to the implementation agent, or a review sub-agent receiving more content than its check requires.
Metrics to track per session:
- Total input tokens (implementation agent call + review sub-agent calls)
- Total output tokens (implementation output + review findings)
- Review block rate (how often the session cannot commit on first pass)
- Tokens per retry (cost of each implementation-review-fix cycle)
See Tokenomics for the full measurement framework.
Defect Source Coverage
This table maps each pre-commit defect source to the mechanism that covers it.
| Defect Source | Catalog Section | Covered By |
|---|---|---|
| Code style violations | Process & Deployment | Lint hook |
| Null/missing data assumptions | Data & State | Type-check hook |
| Secrets in source control | Security & Compliance | Secret-scan hook |
| Injection (pattern-matching) | Security & Compliance | SAST hook |
| Accessibility (structural) | Product & Discovery | Accessibility-lint hook |
| Race conditions (detectable) | Integration & Boundaries | Thread sanitizer (language-specific) |
| Logic errors, edge cases | Process & Deployment | Semantic review agent |
| Implicit domain knowledge | Knowledge & Communication | Semantic review agent |
| Untested paths | Testing & Observability Gaps | Semantic review agent |
| Injection (semantic/second-order) | Security & Compliance | Security review agent |
| Auth/authz gaps | Security & Compliance | Security review agent |
| Missing audit trails | Security & Compliance | Security review agent |
| Missing timeouts | Performance & Resilience | Performance review agent |
| Resource leaks | Performance & Resilience | Performance review agent |
| Missing graceful degradation | Performance & Resilience | Performance review agent |
| Race condition anti-patterns | Integration & Boundaries | Concurrency review agent |
| Non-idempotent consumers | Data & State | Concurrency review agent |
Defect sources not in this table are addressed at CI or acceptance test stages, not at pre-commit. See the Pipeline Reference Architecture for the full gate sequence.
Related Content
- Agentic Architecture Patterns - how to structure skills, agents, commands, and hooks for multi-agent systems
- Pipeline Enforcement and Expert Agents - how the same review agents operate as CI pipeline gates, not just pre-commit
- Small-Batch Sessions - the session discipline the orchestrator and skills enforce
- Tokenomics - the full optimization framework: model routing, context hygiene, structured outputs, prompt caching, and workflow-level measurement
- Agent Delivery Contract - the artifacts the implementation agent receives and the review agents verify against
- Pipeline Reference Architecture - the full gate sequence from pre-commit through production verification
- Systemic Defect Fixes - the defect source catalog that defines what each review agent is responsible for catching
3 - Small-Batch Agent Sessions
One BDD scenario. One agent session. One commit. This is the same discipline CI demands of humans, applied to agents. The broad understanding of the feature is established before any session begins. Each session implements exactly one behavior from that understanding.
Stop optimizing your prompts. Start optimizing your decomposition. The biggest variable in agentic development is not model selection or prompt quality. It is decomposition discipline. An agent given a well-scoped, ordered scenario with clear acceptance criteria will outperform a better model given a vague, large-scope instruction.
Establish the Broad Understanding First
Before any implementation session begins, establish the complete understanding of the feature:
- Intent description - why the change exists and what problem it solves
- All BDD scenarios - every behavior to implement, validated by the specification review before any code is written
- Feature description - architectural constraints, performance budgets, integration boundaries
- Scenario order - the sequence in which you will implement the scenarios
The agent-assisted specification workflow is the right tool here - use the agent to sharpen intent, surface missing scenarios, identify architectural gaps, and validate consistency across all four artifacts before any code is written.
Scenario ordering is not optional. Each scenario builds on the state left by the previous one. An agent implementing Scenario 3 depends on the contracts and data structures Scenario 1 and 2 established. Order scenarios so that each one can be implemented cleanly given what came before. Use an agent for this too: give it your complete scenario list and ask it to suggest an implementation order that minimizes the rework cost of each step.
This ordering step also has a human gate. Review the proposed slice sequence before any implementation begins. The ordering determines the shape of every session that follows.
The broad understanding is not in the implementation agent’s context. Each implementation session receives the relevant subset. The full feature scope lives in the artifacts, not in any single session.
This is not big upfront design. The feature scope is a small batch: one story, one thin vertical slice, completable in a day or two. What constitutes a complete slice depends on your team structure - see Work Decomposition for full-stack versus subdomain teams.
Session Structure
Each session follows the same structure:
| Step | What happens |
|---|---|
| Context load | Assemble the session context: intent summary, feature description, the one scenario for this session, the relevant existing code, and a brief summary of completed sessions |
| Implementation | Agent generates test code and production code to satisfy the scenario |
| Validation | Pipeline runs - all scenarios implemented so far must pass |
| Commit | Change committed; commit message references the scenario |
| Context summary | Write a one-paragraph summary of what this session built, for use in the next session |
The session ends at the commit. The next session starts fresh.
What to include in the context load
Include only what the agent needs to implement this specific scenario. Load context in the order defined in Configuration Quick Start: Context Loading Order - stable content first to maximize prompt cache hits, volatile content last.
For each item, apply the context hygiene test: would omitting it change what the agent produces? If not, omit it.
Exclude:
- Full conversation history from previous sessions
- Scenarios not being implemented in this session
- Unrelated system context
- Verbose examples or rationale that does not change what the agent will do
The context summary
At the end of each session, write a summary that future sessions can use. The summary replaces the session’s full conversation history in subsequent contexts. Keep it factual and brief:
Session 1 implemented Scenario 1 (client exceeds rate limit returns 429).
Files created:
- src/redis.ts - Redis client with connection pooling
- src/middleware/rate-limit.ts - middleware that checks request count
against Redis and returns 429 with Retry-After header when exceeded
Tests added:
- src/middleware/rate-limit.test.ts - covers Scenario 1
All pipeline checks pass.This summary is the complete handoff from one session to the next. The next agent starts with this summary plus its own scenario - not with the full conversation that produced the code.
The Parallel with CI
In continuous integration, the commit is the unit of integration. A developer does not write an entire feature and commit at the end. They write one small piece of tested functionality that can be deployed, commit to the trunk, then repeat. The commit creates a checkpoint: the pipeline is green, the change is reviewable, and the next unit can start cleanly.
Agent sessions follow the same discipline. The session is the unit of context. An agent does not implement an entire feature in one session - context accumulates, performance degrades, and the scope of any failure grows. Each session implements one behavior, ends with a commit, and resets context before the next session begins.
The mechanics differ. The principle is identical: small batches, frequent integration, green pipeline as the definition of done.
Worked Example: Rate Limiting
The agent delivery contract page establishes an intent description and two BDD scenarios for rate limiting the /api/search endpoint. Here is what the full session sequence looks like.
Broad understanding (established before any session)
Intent summary:
Limit authenticated clients to 100 requests per minute on
/api/search. Requests exceeding the limit receive 429 with a Retry-After header. Unauthenticated requests are not limited.
All BDD scenarios, in implementation order:
Feature description (excerpt):
Use Redis as the rate limit store with a sliding window counter. The middleware runs after auth and reads the client ID from the JWT. The rate limit key format is
rate_limit:{client_id}:{window_start_minute}. Performance budget: middleware must add less than 5ms to p99 latency.
Session 1 - Scenario 1 (client within rate limit)
Context loaded:
- Intent summary (2 sentences)
- Feature description
- Scenario 1 only
- Existing middleware directory structure
What the agent implements:
- Redis client at
src/redis.ts - Rate limit middleware at
src/middleware/rate-limit.tsthat increments the counter and sets quota headers on successful requests - Acceptance test for Scenario 1
Commit: feat: add rate limit quota headers for requests within limit
Context summary written:
Session 1 implemented Scenario 1 (client within rate limit).
Files created:
- src/redis.ts - Redis client, connection pooling, configured from env
- src/middleware/rate-limit.ts - reads client ID from JWT, increments
Redis counter for current window, sets X-RateLimit-Remaining header
Tests added:
- src/middleware/rate-limit.test.ts - covers Scenario 1
Pipeline is green.Session 2 - Scenario 2 (client exceeds rate limit)
Context loaded:
- Intent summary (2 sentences)
- Feature description
- Scenario 2 only
- Session 1 summary
src/middleware/rate-limit.ts(the file being extended)
What the agent implements:
- Branch in the middleware that returns 429 and sets Retry-After when the counter exceeds 100
- Acceptance test for Scenario 2
- Scenario 1 test continues to pass
Commit: feat: return 429 with Retry-After when rate limit exceeded
Context summary written:
Sessions 1-2 implemented Scenarios 1 and 2.
Files:
- src/redis.ts - Redis client (unchanged from Session 1)
- src/middleware/rate-limit.ts - checks counter against limit of 100;
returns 429 with Retry-After header when exceeded, quota headers when
within limit
Tests:
- src/middleware/rate-limit.test.ts - covers Scenarios 1 and 2
Pipeline is green.Session 3 - Scenario 3 (window reset)
Context loaded:
- Intent summary (2 sentences)
- Feature description
- Scenario 3 only
- Sessions 1-2 summary
src/middleware/rate-limit.ts
What the agent implements:
- TTL set on the Redis key so the counter expires at the window boundary
- Retry-After value calculated from window boundary
- Acceptance test for Scenario 3
Commit: feat: expire rate limit counter at window boundary
Session 4 - Scenario 4 (unauthenticated bypass)
Context loaded:
- Intent summary (2 sentences)
- Feature description
- Scenario 4 only
- Sessions 1-3 summary
src/middleware/rate-limit.ts
What the agent implements:
- Early return in the middleware when no authenticated client ID is present
- Acceptance test for Scenario 4
Commit: feat: bypass rate limiting for unauthenticated requests
What the session sequence produces
Four commits, each independently reviewable. Each commit corresponds to a named, human-defined behavior. The pipeline is green after every commit. The context in each session was small: intent summary, one scenario, one file, a brief summary of prior work.
A reviewer can look at Session 2’s commit and understand exactly what it does and why without reading the full feature history. That is the same property CI produces for human-written code.
The Commit as Context Boundary
The commit is not just a version control operation. In an agent workflow, it is the context boundary.
Before the commit: the agent is building toward a green state. The session context is open.
After the commit: the state is known, captured, and stable. The next session starts from this stable state - not from the middle of an in-progress conversation.
This has a practical implication: do not let an agent session span a commit boundary. A session that starts implementing Scenario 1 and then continues into Scenario 2 accumulates context from both, mixes the conversation history of two distinct units, and produces a commit that cannot be reviewed cleanly. Stop the session at the commit. Start a new session for the next scenario.
When the Pipeline Fails
If the pipeline fails mid-session, the session is not done. Do not summarize completed work and do not start a new session. The agent’s job in this session is to get the pipeline green.
If the pipeline fails in a later session (a prior scenario breaks), the agent must restore the passing state before implementing the new scenario. This is the same discipline as the CI rule: while the pipeline is red, the only valid work is restoring green. See ACD constraint 8.
Related Content
- ACD Workflow - the full workflow these sessions implement, including constraint 8 (pipeline red means restore-only work)
- Agent-Assisted Specification - how to establish the broad understanding before sessions begin
- Small Batches - the same discipline applied to human-authored work
- Work Decomposition - vertical slicing defined for both full-stack product teams and subdomain product teams in distributed systems
- Horizontal Slicing - the anti-pattern that emerges when distributed teams split work by layer instead of by behavior within their domain
- The Four Prompting Disciplines - context engineering and specification engineering applied to session design
- Tokenomics - why context size matters and how to control it
- Agent Delivery Contract - the artifacts that anchor each session’s context
- Pitfalls and Metrics - failure modes including the review queue backup that small sessions prevent