Agentic Architecture Patterns
28 minute read
Agentic workflow architecture is a software design problem. The same principles that prevent spaghetti code in application software - single responsibility, well-defined interfaces, separation of concerns - prevent spaghetti agent systems. The cost of getting it wrong is measured in token waste, cascading failures, and workflows that break when you swap one model for another.
This page assumes familiarity with Agent Delivery Contract. After reading this page, see Coding & Review Setup for a concrete implementation of these patterns applied to coding and pre-commit review.
Overview
A multi-agent system that was not deliberately designed looks like a distributed monolith: everything depends on everything else, context passes unchecked through every boundary, and no component has clear ownership. Add token costs to the usual distributed systems failure modes and the problem compounds: a carelessly assembled context bundle that reaches a frontier model five times per workflow iteration is not a minor inefficiency, it is a recurring tax on every workflow run.
Three failure patterns appear consistently in poorly structured agentic systems:
Token waste from undisciplined context. Without explicit rules about what passes between components, agents accumulate context until the window fills or costs spike. An agent that receives a 50,000-token context when its actual task requires 5,000 tokens wastes 90% of its input budget on every invocation.
Cascading failures from missing error boundaries. When one agent’s unstructured prose output becomes another agent’s input, parsing ambiguity becomes a failure source. A model that produces a slightly different output format than expected on one run can silently corrupt downstream agent behavior without triggering any explicit error.
Brittle workflows from model-coupled instructions. Skills and commands written for one model’s specific instruction style often degrade when run on a different model. Workflows that hard-code model-specific behaviors - Claude’s particular handling of XML tags, Gemini’s response to certain role descriptions - cannot be handed off or used in multi-model configurations without manual rewriting.
Getting architecture right addresses all three. The sections below give patterns for each component type: skills, agents, commands, hooks, and the cross-cutting concerns that tie them together.
Key takeaways:
- Undisciplined context passing is the primary cost driver in agentic systems.
- Structured outputs at every agent boundary eliminate parsing-based cascade failures.
- Model-agnostic design is achievable by separating task logic from model-specific invocation details.
Skills
What a Skill Is
A skill is a named, reusable procedure that an agent can invoke by name. It encodes a sequence of steps, a set of rules, or a decision procedure that would otherwise need to be re-derived from scratch each time the agent encounters a given situation.
Skills are not plugins or function calls in the API sense. They are instruction documents - typically markdown files - that are injected into an agent’s context when invoked. The agent reads the skill, follows its instructions, and returns a result. The skill has no runtime; it is pure specification.
This distinction matters. Because a skill is just text, it works across models that can read and follow natural language instructions. Claude, Gemini, and any other capable model can follow the same skill document. This is the foundation of model-agnostic workflow design.
Single Responsibility
A skill should do one thing. The temptation to combine related procedures into a single skill (“review code AND write the commit message AND update the changelog”) produces a skill that is hard to test, hard to maintain, and hard to invoke selectively. When a multi-step procedure fails, a single-responsibility skill makes it obvious which step went wrong and where to look.
Signs a skill is doing too much:
- The skill name contains “and”
- The skill has conditional branches that activate completely different code paths depending on input
- Different sub-agents invoke the skill but only use half of it
Signs a skill should be extracted:
- The same sequence of steps appears in two or more larger skills
- A step in a skill has grown to match the complexity of the skill itself
- A sub-agent needs only part of a skill’s behavior but must receive all of it
When to Inline vs. Extract
Inline instructions when a procedure is used exactly once, is tightly coupled to the specific agent’s context, or is too short to justify its own file (under 5-6 lines of instruction). Extract to a skill file when a procedure is reused, when it will be maintained independently of the agent configuration, or when it is long enough that reading the agent’s system prompt requires scrolling past it.
A useful test: if you replaced the inline instruction with a skill reference, would the agent system prompt read more clearly? If yes, extract it.
File and Folder Structure
Organize skills in a flat or two-level hierarchy within a skills/ directory. Avoid deeply nested skill trees - when an agent needs to invoke a skill, it should be obvious where to find it.
Keeping separate skills/ directories per model is not duplication if the skills differ in ways specific to that model’s behavior. It is a problem if the skills differ only because they were written at different times by different people without a shared template. The goal is model-agnostic skills that live in a shared location; model-specific variants should be the exception and should be explicitly labeled as such.
Writing Model-Agnostic Skill Instructions
Skills written to exploit one model’s specific behaviors create lock-in. The following practices produce skills that transfer well:
Use explicit imperative steps, not conversational prose. Both Claude and Gemini follow numbered step lists more reliably than embedded instructions in flowing text.
State output format explicitly. Do not assume a model will infer the desired output format from context. Specify it. “Return a JSON object with the schema shown below” is unambiguous. “Return the results” is not.
Avoid model-specific XML or prompt syntax. Claude responds to <instructions> tags; Gemini does not require them. Skills that depend on XML delimiters need adaptation when moved between models. Use plain markdown structure instead.
State scope and early exit conditions. Both models benefit from explicit scope limits (“analyze only the files in the staged diff”) and early exit conditions (“if the diff contains only comments and whitespace, return an empty findings list immediately”). These reduce unnecessary processing and keep outputs predictable.
Claude Implementation Example
Gemini Implementation Example
The same skill for Gemini. The task logic is identical. The structural differences reflect Gemini’s preference for explicit role framing and its handling of early exit conditions:
The differences are explicit: Gemini benefits from named input fields (bdd_scenario, test_file) and an explicit role statement. Claude handles the simpler inline description of inputs without role framing. Both produce the same JSON output, which means the skill is interchangeable at the orchestration layer even though the instruction text differs.
Key takeaways:
- Skills are instruction documents, not code. They work across any model that can follow natural language instructions.
- Single responsibility prevents unclear failure attribution and oversized context bundles.
- Model-agnostic skills share task logic; model-specific variants differ only in structural framing, not in output contract.
Agents
Defining Agent Boundaries
An agent boundary is a context boundary and a responsibility boundary. What an agent knows, what it can do, and what it must return are determined by what crosses the boundary.
Define boundaries by asking: what is the smallest coherent unit of work this agent can own? “Coherent” means the agent can complete its work without reaching outside its assigned context. An agent that regularly requests additional files, broader system context, or information from other agents mid-task has a boundary problem - its responsibility was scoped incorrectly.
Responsibility and context are coupled. An agent with a narrow responsibility needs a small context. An agent with a broad responsibility needs a large context and likely should be decomposed.
When One Agent Is Enough
Use a single agent when:
- The workflow has one clear task with a well-scoped context requirement
- The work is short enough to complete within a single context window without degradation
- There is no meaningful parallelism available (each step depends on the previous step’s output)
- The cost of the inter-agent communication overhead exceeds the cost of doing the work in a single agent
Decomposing into multiple agents introduces latency, context assembly overhead, and additional failure surfaces. Do not decompose for the sake of architectural elegance. Decompose when there is a concrete benefit: parallelism, context budget enforcement, or specialized model routing.
When to Decompose
Decompose when:
- Parallel execution is possible and would meaningfully reduce latency (review sub-agents running concurrently instead of sequentially)
- Different tasks within a workflow have different model tier requirements (routing cheap coordination to a small model, expensive reasoning to a frontier model)
- A task has grown too large to fit in a single well-scoped context without degrading output quality
- Separation of concerns requires that one agent not be able to see or influence another agent’s domain (the implementation agent must not perform its own review)
Passing Context Without Bloat
Context passed between agents must be explicitly scoped. The default should be “send only what this agent needs,” not “send everything the orchestrator has.”
Rules for inter-agent context:
- Define a schema for what each agent receives. Treat it like an API contract.
- Send structured data (JSON, YAML) rather than prose summaries. Prose requires the receiving agent to parse intent; structured data makes intent explicit.
- Strip conversation history at every boundary. The receiving agent needs the result of prior work, not the reasoning that produced it.
- Send diffs, not full file contents, when the agent’s task is about changes.
Handling Failure Modes
Agent failures fall into three categories, each requiring a different response:
Hard failure (the agent returns an error or a malformed response). Retry once with identical input. If the second attempt fails, escalate to the orchestrator with the raw error; do not attempt to interpret it in the calling agent.
Soft failure (the agent returns a valid response indicating a blocking issue). This is not a failure of the agent - it is the agent doing its job. Route the finding to the appropriate handler (typically returning it to the implementation agent for resolution) without treating it as an error condition.
Silent degradation (the agent returns a valid-looking response that is subtly wrong). This is the hardest failure mode to detect. Defend against it with output schemas and schema validation at every boundary. A response that does not conform to the expected schema should be treated as a hard failure, not silently accepted.
Multi-Agent Pipeline Example: Release Readiness Checks
The following example shows a release readiness pipeline with Claude as orchestrator and Gemini as a specialized long-context sub-agent. A release candidate artifact is routed to three parallel checks - changelog completeness, documentation coverage, and dependency audit - each receiving only what its specific check requires.
This configuration makes sense when the changelog or dependency manifest is large enough that a single-agent approach risks context window degradation. Gemini handles the large-context changelog analysis; Claude handles routing and the two lighter checks.
Orchestrator (Claude) - context assembly and routing:
Changelog review sub-agent (Gemini) - specialized for long changelog analysis:
In this configuration, Claude handles orchestration because routing and context assembly do not require long-context capability. Gemini handles changelog review because a full changelog for a major release can be large enough to crowd out other context in a smaller window. Neither assignment is mandatory - the point is that the structured interface (JSON input, JSON output with a defined schema) makes the sub-agent swappable. Replacing the Gemini changelog agent with a Claude one requires changing only the invocation target, not the orchestration logic.
For a concrete application of this pattern to coding and pre-commit review - including full system prompt rules for each agent - see Coding & Review Setup.
Key takeaways:
- Agent boundaries are context boundaries. Scope responsibility so an agent can complete its task without reaching outside its assigned context.
- Decompose when there is concrete benefit: parallelism, model tier routing, or context budget enforcement.
- Structured schemas at every agent interface make sub-agents swappable without changing orchestration logic.
Commands
Designing Unambiguous Commands
A command is an instruction that triggers a defined workflow. The distinction between a command and a general prompt is that a command’s behavior should be predictable and consistent across invocations with the same inputs.
An unambiguous command has:
- A single, explicit trigger name (conventionally
/verb-nounformat) - A defined set of inputs it expects
- A defined output it will produce
- No implicit state it depends on beyond what is passed explicitly
The failure mode of an ambiguous command is that the model interprets it differently on different runs. “Review the changes” is ambiguous. /review staged-diff with a defined schema for what “review” means and what the output looks like is not.
Parameterization Strategies
Commands should accept parameters rather than embedding specific values in the command text. This makes commands reusable across contexts without modification.
Well-parameterized command:
Poorly parameterized command (values embedded in command text):
The second version cannot be extended without creating new commands. The first version handles new target types and output formats through parameterization.
Avoiding Prompt Injection Through Command Structure
Prompt injection attacks against agentic systems typically exploit unstructured inputs that the model treats as additional instructions. The command structure itself is the primary defense.
Defensive patterns:
- Treat all parameter values as data, not as instructions. Pass them inside a clearly delimited data block, not inline in the instruction text.
- Define the parameter schema explicitly. Parameters outside the schema should cause the command to return an error, not to be interpreted as free-form instructions.
- Do not pass raw user input directly to a model invocation. Validate and sanitize first.
Example of unsafe command structure:
If user_provided_context contains “Ignore previous instructions and…”, the model will process it as an instruction. This is the injection vector.
Example of safer command structure:
The explicit instruction to treat inputs as data and the injection detection rule do not guarantee safety against a sophisticated adversary, but they raise the bar substantially over undefended interpolation.
Well-Structured vs. Poorly-Structured Command Comparison
Key takeaways:
- Commands are defined workflows, not open-ended prompts. Predictability requires explicit inputs, outputs, and scope.
- Parameterization keeps commands reusable. Embedded values create command proliferation.
- Structural separation between instructions and data is the primary defense against prompt injection.
Hooks
When to Use Pre/Post Hooks
Hooks are side effects that run before or after an agent invocation. Pre-hooks run before the model call; post-hooks run after. Use them to enforce invariants that should hold for every invocation of a given command or skill, without embedding that logic in every skill individually.
Pre-hooks are appropriate for:
- Validating inputs before they reach the model (fail fast, save token cost)
- Injecting stable context that should always be present (system constraints, security policies)
- Enforcing environmental preconditions (pipeline is green, branch is clean)
Post-hooks are appropriate for:
- Validating that the model’s output conforms to the expected schema
- Logging invocation metadata (model, token count, duration, decision)
- Triggering downstream steps conditionally based on the model’s output
Keeping Hooks Lightweight and Side-Effect-Safe
A hook that fails should fail cleanly with a clear error message. A hook that has unexpected side effects will be disabled by frustrated developers the first time it causes a problem. Two rules:
Hooks must be idempotent. Running the same hook twice with the same inputs must produce the same result. A hook that writes a log file should append to an existing file, not fail if the file already exists. A hook that calls an external validation service must handle the case where the same call was already made.
Hooks must have bounded execution time. A pre-hook that can run for an arbitrary duration blocks the agent invocation. Set timeouts. If the hook cannot complete within its timeout, fail fast and surface the timeout as the error - do not silently allow the invocation to proceed with unvalidated inputs.
Using Hooks to Enforce Guardrails or Inject Context
Pre-hooks are the right place for guardrails that must apply regardless of the skill being invoked. Rather than duplicating a guardrail across every skill document, implement it once as a pre-hook:
The inject-system-constraints hook demonstrates the context injection pattern. Rather than including system constraints in every skill document, the hook injects them at invocation time. This guarantees they are always present without creating maintenance risk from outdated copies embedded in individual skill files.
A Cross-Model Hook Example
The following hook works identically regardless of whether Claude or Gemini is being invoked. It validates that the agent’s output conforms to the expected JSON schema before the orchestrator processes it.
This hook exits with a non-zero code if the output is malformed, which causes the orchestrator to treat the invocation as a hard failure. The hook does not know or care whether the output came from Claude or Gemini - it validates the contract, not the model.
Key takeaways:
- Pre-hooks enforce preconditions; post-hooks validate outputs. Both must be idempotent and bounded in execution time.
- Guardrails implemented as hooks apply universally without being duplicated across skill documents.
- Output schema validation as a post-hook is the primary defense against silent degradation at agent boundaries.
Cross-Cutting Concerns
Logging and Observability
Every agent invocation should produce a structured log record. Debugging an agentic workflow without structured logs is impractical - invocations are non-deterministic, inputs vary, and failures manifest differently across runs.
Minimum log record per invocation:
Track at the workflow level, not the call level. A single /review command may invoke four sub-agents. The relevant metric is total token cost and duration for the /review command, not the cost of each sub-agent call in isolation.
Both Claude and Gemini expose token counts in their API responses. Claude exposes them under usage.input_tokens and usage.output_tokens with separate fields for cache_read_input_tokens and cache_creation_input_tokens. Gemini exposes them under usageMetadata.promptTokenCount and usageMetadata.candidatesTokenCount. Normalize these into a shared log schema in your orchestration layer.
Idempotency
Agentic workflows will be retried - by developers manually, by CI systems automatically, and by error recovery paths. A workflow that is not idempotent will produce inconsistent state when retried.
Rules for idempotent agent workflows:
- Assign a stable ID to each workflow run at start time. Use this ID for deduplication in any downstream systems the workflow touches.
- Agent invocations that produce the same output for the same input are naturally idempotent. State-changing side effects (writing files, calling external APIs) require explicit deduplication.
- Write-once outputs (session summaries, review findings written to a file) should check for existing output before writing. A retry that overwrites a passing review finding with a new failing one has broken idempotency.
Testing Agentic Workflows
Testing agentic workflows requires testing at multiple levels:
Skill unit tests. Test each skill document in isolation by invoking it with controlled inputs and asserting on the output structure. Use a deterministic input set (a known diff, a known scenario) and verify that the output schema is correct and that the decision matches expectations.
Agent integration tests. Test the full agent with a controlled context bundle. These tests will not be perfectly deterministic across model versions, but they should produce consistent structural outputs (valid JSON, correct schema, plausible decisions) for a given stable input.
Workflow end-to-end tests. Test the full workflow path with a representative scenario. These are slower and more expensive but necessary to catch problems that only emerge at the orchestration layer, such as context assembly bugs or incorrect routing decisions.
A useful heuristic: if a skill cannot be tested with a controlled input-output pair, it is not well-scoped enough. The ability to write a unit test for a skill is a signal that the skill has a clear responsibility and a defined contract.
Model-Agnostic Abstraction Layer
The abstraction layer between your workflow logic and the specific model API is the most important structural decision in a multi-model agentic system. Without it, every change in model availability, pricing, or capability requires changes throughout the orchestration logic.
A minimal abstraction layer defines a ModelClient interface with a single invoke method that accepts a context bundle and returns a structured response:
With this layer in place, the orchestrator does not reference Claude or Gemini directly. It holds a ModelClient reference and calls invoke(). Swapping models means changing the client instantiation at configuration time, not rewriting orchestration logic.
Where Claude and Gemini differ at the API level:
- System prompt placement. Claude separates system content via the
systemparameter. Gemini usessystemInstruction. Your abstraction layer must handle this mapping. - Prompt caching. Claude’s prompt caching uses cache-control annotations on specific message blocks. Gemini’s implicit caching triggers automatically on long stable prefixes. Caching strategies differ and cannot be abstracted into a single identical interface - expose caching as an optional configuration, not a required behavior.
- Structured output support. Claude returns structured outputs through its response format parameter (JSON mode). Gemini supports structured output through
responseMimeTypeandresponseSchemain the generation config. If your workflows require structured output enforcement at the API level (beyond instructing the model in the prompt), handle this in the concrete client implementations, not in the abstraction layer. - Token counting. The field names differ (noted in the Logging section above). Normalize in the abstraction layer.
Key takeaways:
- Every agent invocation should emit a structured log record with token counts and duration.
- Idempotency requires explicit deduplication for any state-changing side effects in a workflow.
- A model-agnostic abstraction layer is the single most important structural investment for multi-model systems.
Anti-patterns
1. The Monolithic Orchestrator
What it looks like: One agent handles orchestration, implementation, review, and summarization. It receives the full project context on every invocation and runs to completion in a single long-running session.
Why it fails: Context accumulates until quality degrades or the window fills. There is no opportunity to route subtasks to cheaper models. A failure anywhere in the monolithic run requires restarting from the beginning. The agent cannot be parallelized.
What to do instead: Decompose into an orchestrator with single-responsibility sub-agents. Each agent receives only the context its task requires. The orchestrator coordinates; it does not execute.
2. Natural Language Agent Interfaces
What it looks like: Agents communicate by passing prose summaries to each other. “The implementation agent completed the login feature. The tests pass and the code looks good. Please proceed with the review.”
Why it fails: Prose is ambiguous. A downstream agent must parse intent from the prose, which introduces a failure point that becomes more likely as model outputs vary between invocations. Prose is also token-inefficient: the same information encoded as JSON takes fewer tokens and is unambiguous.
What to do instead: Define a JSON schema for every agent interface. Agents return structured data. Orchestrators parse structured data. Natural language is reserved for human-readable summaries, not inter-agent communication.
3. Context That Does Not Expire
What it looks like: Session context grows continuously. Prior session conversations are appended rather than summarized. The implementation agent receives the full history of all prior sessions because “it might need it.”
Why it fails: Context that does not expire grows without bound. Token costs increase linearly with context size. Model performance on tasks can degrade as context grows, particularly for tasks in the middle of a large context window. Context that is always present but rarely relevant is a tax on every invocation.
What to do instead: Summarize at session boundaries. A session summary of 100-150 words replaces a full session conversation for future contexts. The summary contains what the next session needs - not what happened, but what exists and what state the system is in.
4. Skills Written for One Model’s Idiosyncrasies
What it looks like: Skills use Claude-specific XML delimiters (<examples>, <context>), or Gemini-specific role framing that other models do not respond to. The skill file has comments like “this only works on Claude Opus.”
Why it fails: Model-specific skills create lock-in. A skill library that cannot be used with a different model cannot survive a pricing change, a capability change, or an organizational decision to switch providers. Testing is harder because the skill cannot be validated against a cheaper model during development.
What to do instead: Write skills using plain markdown structure. Numbered steps, explicit input/output schemas, and early exit conditions work consistently across capable models. When a model-specific variant is genuinely necessary, isolate it in a model-specific subdirectory and document why it differs.
5. Missing Output Schema Validation
What it looks like: The orchestrator passes an agent’s response directly to the next step without validating that the response conforms to the expected schema. If the model produces a slightly malformed JSON object, the downstream step either fails with an opaque error or silently processes incorrect data.
Why it fails: Models do not produce perfectly consistent structured output on every invocation. Occasional schema violations are normal and expected. Without validation, these violations propagate downstream before manifesting as failures, making the root cause hard to trace.
What to do instead: Validate schema at every agent boundary using a post-invoke hook. A non-conforming response is a hard failure at the boundary where it occurred, not an opaque error two steps downstream.
6. Hooks With Unconstrained Side Effects
What it looks like: A pre-hook makes a network call to an external service to validate an input. The external service is occasionally slow or unavailable. On slow runs, the hook blocks the agent invocation for several minutes. On unavailability, the hook fails in a way that leaves partial state in the external service.
Why it fails: Hooks with unconstrained side effects are unpredictable. A hook that can fail in an unclean way, block for an unbounded duration, or write partial state to an external system will be disabled by the team after the first time it causes a production incident or a corrupted workflow run.
What to do instead: Hooks must have explicit timeouts. All external calls in hooks must be idempotent. A hook that cannot complete idempotently within its timeout must fail fast and surface the timeout as a clear error, not silently allow the invocation to proceed.
7. Swapping Models Without Adjusting Context Structure
What it looks like: A workflow designed for Claude is migrated to Gemini by changing only the API call. The skill documents, context assembly order, and prompt structure remain unchanged.
Why it fails: Claude and Gemini have different behaviors around context structure. Prompt caching works differently (Claude requires explicit cache annotations; Gemini uses implicit prefix matching). System prompt handling differs. Some instruction patterns that Claude follows reliably require adjustment for Gemini. A direct swap without validation produces degraded and unpredictable outputs.
What to do instead: Treat a model swap as a migration, not a configuration change. Test each skill against the new model with controlled inputs. Adjust context structure, system prompt placement, and output instructions as needed. Use the model-agnostic abstraction layer so that only the concrete client and the per-model skill variants need to change.
Related Content
- Coding & Review Setup - a concrete orchestrator and sub-agent configuration applying these patterns
- Tokenomics - the full optimization framework for token cost management
- Small-Batch Sessions - how session discipline maps to the skill and hook patterns here
- Pipeline Enforcement and Expert Agents - how the same agent patterns operate as CI pipeline gates
- Agent Delivery Contract - the structured artifacts that flow between agents as defined interfaces
- Pitfalls and Metrics - failure modes and measurement for agentic workflows