These pages cover the operational side of ACD: how the pipeline enforces constraints, how to manage token costs, and how to measure whether agentic delivery is working.
This is the multi-page printable view of this section. Click here to print.
Operations & Governance
- 1: Pipeline Enforcement and Expert Agents
- 2: Tokenomics: Optimizing Token Usage in Agent Architecture
- 3: Pitfalls and Metrics
1 - Pipeline Enforcement and Expert Agents
The pipeline is the enforcement mechanism for agentic continuous delivery (ACD). Standard quality gates handle mechanical checks. Expert validation agents handle the judgment calls that standard tools cannot make.
For the framework overview, see ACD. For the artifacts the pipeline enforces, see Agent Delivery Contract.
How Quality Gates Enforce ACD
The Pipeline Verification and Deployment stages of the ACD workflow are where the Pipeline Reference Architecture does the heavy lifting. Each pipeline stage enforces a specific ACD constraint:
- Pre-commit gates (linting, type checking, secret scanning, SAST) catch the mechanical errors agents produce most often: style violations, type mismatches, and accidentally embedded secrets. These run in seconds and give the agent immediate feedback.
- CI Stage 1 (build + unit tests) validates the acceptance criteria. If human-defined tests fail, the agent’s implementation is wrong regardless of how plausible the code looks.
- CD Stage 1 (contract + schema tests) enforces the system constraints artifact at integration boundaries. Agent-generated code is particularly prone to breaking implicit contracts between modules or services.
- CD Stage 2 (mutation testing, performance benchmarks, security integration tests) catches the subtle correctness issues that agents introduce: code that passes tests but violates non-functional requirements or leaves untested edge cases.
- Acceptance tests validate the user-facing behavior artifact in a production-like environment. This is where the BDD scenarios become automated verification.
- Production verification (canary deployment, health checks, SLO monitors with auto-rollback) provides the final safety net. If agent-generated code degrades production metrics, it rolls back automatically.
The Pre-Feature Baseline
The pre-feature baseline lists the required baseline gates that must be active before any feature work begins. These are a prerequisite for ACD. Without them passing on every commit, agent-generated changes bypass the minimum safety net.
See the pipeline patterns for concrete architectures that implement these gates:
Expert Validation Agents
Standard quality gates cover what conventional tooling can verify: linting, type checking, test execution, vulnerability scanning. But ACD introduces validation needs that standard tools cannot address. No conventional tool can verify that test code faithfully implements a human-defined test specification. No conventional tool can verify that an agent-generated implementation matches the architectural intent in a feature description.
Expert validation agents fill this gap. These are AI agents dedicated to a specific validation concern, running as pipeline gates alongside standard tools. The following are examples, not an exhaustive list - teams should create expert agents for whatever validation concerns their pipeline requires:
| Example Agent | What It Validates | Catches | Artifact It Enforces |
|---|---|---|---|
| Test fidelity agent | Test code exercises the scenarios, edge cases, and assertions defined in the test specification | Agent-generated tests that omit edge cases or weaken assertions | Acceptance Criteria |
| Implementation coupling agent | Test code verifies observable behavior, not internal implementation details | Tests that break when implementation is refactored without any behavior change | Acceptance Criteria |
| Architectural conformance agent | Implementation follows the constraints in the feature description | Code that crosses a module boundary or uses a prohibited dependency | Feature Description |
| Intent alignment agent | The combined change addresses the problem stated in the intent description | Implementations that are technically correct but solve the wrong problem | Intent Description |
| Constraint compliance agent | Code respects system constraints that static analysis cannot check | Violations of logging standards, feature flag requirements, or audit rules | System Constraints |
Adopting Expert Agents: The Same Replacement Cycle
Do not deploy expert agents and immediately reduce human review. Expert validation agents need calibration before they can replace human judgment. An agent that flags too many false positives trains the team to ignore it. An agent that misses real issues creates false confidence. Run expert agents in parallel with human review for at least 20 cycles before any reduction in human coverage.
Expert validation agents are new automated checks. Adopt them using the same replacement cycle that drives every brownfield CD migration:
- Identify a manual validation currently performed by a human reviewer. For example, checking whether test code actually tests what the specification requires.
- Automate the check by deploying an expert agent as a pipeline gate. The agent runs on every change and produces a pass/fail result with reasoning.
- Validate by running the expert agent in parallel with the existing human review. Compare results across at least 20 review cycles. If the agent matches human decisions on 90%+ of cases and catches at least one issue the human missed, proceed to the removal step.
- Remove the manual check once the expert agent has proven at least as effective as the human review it replaces.
Expert validation agents run on every change, immediately, eliminating the batching that manual review imposes. Humans steer; agents validate at pipeline speed.
With the pipeline and expert agents in place, the next question is what goes wrong and how to measure progress. See Pitfalls and Metrics.
Related Content
- Agentic Architecture Patterns - multi-agent pipeline patterns and hook design for enforcement workflows
- ACD - the framework overview, eight constraints, and workflow
- Agent Delivery Contract - the artifacts the pipeline enforces
- Pipeline Reference Architecture - the full quality gate sequence
- Replacing Manual Validations - the replacement cycle for adopting automated checks
- Pitfalls and Metrics - what goes wrong and how to measure progress
- AI Adoption Roadmap - the prerequisite sequence, especially Harden Guardrails and Reduce Delivery Friction
2 - Tokenomics: Optimizing Token Usage in Agent Architecture
Token costs are an architectural constraint, not an afterthought. Treating them as a first-class concern alongside latency, throughput, and reliability prevents runaway costs and context degradation in agentic systems.
Every agent boundary is a token budget boundary. What passes between components represents a cost decision. Designing agent interfaces means deciding what information transfers and what gets left behind.
What Is a Token?
A token is roughly three-quarters of a word in English. Billing, latency, and context limits all depend on token consumption rather than word counts or API call counts. Three factors determine your costs:
- Input vs. output pricing - Output tokens cost 2-5x more than input tokens because generating tokens is computationally more expensive than reading them. Instructions to “be concise” yield higher returns than most other optimizations because they directly reduce the expensive side of the equation.
- Context window size - Large context windows (150,000+ tokens) create false confidence. Extended contexts increase latency, increase costs, and can degrade model performance when relevant information is buried mid-context.
- Model tier - Frontier models cost 10-20x more per token than smaller alternatives. Routing tasks to appropriately sized models is one of the highest-leverage cost decisions.
How Agentic Systems Multiply Token Costs
Single-turn interactions have predictable, bounded token usage. Agentic systems do not.
Context grows across orchestrator steps. Sub-agents receive oversized context bundles containing everything the orchestrator knows, not just what the sub-agent needs. Retries and branches multiply consumption - a failed step that retries three times costs four times the tokens of a step that succeeds once. Long-running agent sessions accumulate conversation history until the context window fills or performance degrades.
Optimization Strategies
1. Context Hygiene
Strip context that does not change agent behavior. Common sources of dead weight:
- Verbose examples that could be summarized
- Repeated instructions across system prompt and user turns
- Full conversation history when only recent turns are relevant
- Raw data dumps when a structured summary would serve
Test whether removing content changes outputs. If behavior is identical with less context, the removed content was not contributing.
2. Target Output Verbosity
Output costs more than input, so reducing output verbosity has compounding returns. Instructions to agents should specify:
- The response format (structured data beats prose for machine-readable outputs)
- The required level of detail
- What to omit
A code generation agent that returns code plus explanation plus rationale plus alternatives costs significantly more than one that returns only code. Add the explanation when needed; do not add it by default.
3. Structured Outputs for Inter-Agent Communication
Natural language prose between agents is expensive and imprecise. JSON or other structured formats reduce token count and eliminate ambiguity in parsing. Compare the two representations of the same finding:
The JSON version conveys the same information in a fraction of the tokens and requires no natural language parsing step. When one agent’s output becomes another agent’s input, define a schema for that interface the same way you would define an API contract.
This applies directly to the agent delivery contract: intent descriptions, feature descriptions, test specifications, and other artifacts passed between agents should be structured documents with defined fields, not open-ended prose.
4. Strategic Prompt Caching
Prompt caching stores stable prompt sections server-side, reducing input costs on repeated requests. To maximize cache effectiveness:
- Place system prompts, tool definitions, and static instructions at the top of the context
- Group stable content together so cache hits cover the maximum token span
- Keep dynamic content (user input, current state) at the end where it does not invalidate the cached prefix
For agents that run repeatedly against the same codebase or documentation, caching the shared context can reduce effective input costs substantially.
5. Model Routing by Task Complexity
Not every task requires a frontier model. Match model tier to task requirements:
| Task type | Appropriate tier | Relative cost |
|---|---|---|
| Classification, routing, extraction | Small model | 1x |
| Summarization, formatting, simple Q&A | Small to mid-tier | 2-5x |
| Code generation, complex reasoning | Mid to frontier | 10-20x |
| Architecture review, novel problem solving | Frontier | 15-30x |
An orchestrator using a frontier model to decide which sub-agent to call, when a small classifier would suffice, wastes tokens on both the decision and the overhead of a larger model.
6. Summarization Cadence
Long-running agents accumulate conversation history. Rather than passing the full transcript to each step, replace completed work with a compact summary:
- Summarize completed steps before starting the next phase
- Archive raw history but pass only the summary forward
- Include only the summary plus current task context in each agent call
This limits context growth without losing the information needed for the next step. Apply this pattern whenever an agent session spans more than a few turns.
7. Workflow-Level Measurement
Per-call token counts hide the true cost drivers. Measure token spend at the workflow level - aggregate consumption for a complete execution from trigger to completion.
Workflow-level metrics expose:
- Which orchestration steps consume disproportionate tokens
- Whether retry rates are multiplying costs
- Which sub-agents receive more context than their output justifies
- How costs scale with input complexity
Track cost per workflow execution the same way you track latency and error rates. Set budgets and alert when executions exceed them. A workflow that occasionally costs 10x the average is a design problem, not a billing detail.
8. Code Quality as a Token Cost Driver
Poorly structured or poorly named code is expensive in both token cost and output quality. When code does not express intent, agents must infer it from surrounding code, comments, and call sites - all of which consume context budget. The worse the naming and structure, the more context must load before the agent can do useful work.
Naming as context compression:
- A function named
processDatarequires surrounding code, comments, and call sites before an agent can understand its purpose. A function namedcalculateOrderTaxis self-documenting - intent is resolved by the name, not from the context budget. - Generic names (
temp,result,data) and single-letter variables shift the cost of understanding from the identifier to the surrounding code. That surrounding code must load into every prompt that touches the function. - Inconsistent terminology across a codebase - the same concept called
user,account,member, orcustomerin different files - forces agents to spend tokens reconciling vocabulary before applying logic.
Structure as context scope:
- Large functions that do many things cannot be understood in isolation. The agent must load more of the file, and often more files, to reason about a single change.
- Deep nesting and high cyclomatic complexity require agents to track multiple branches simultaneously, consuming context budget that would otherwise go toward the actual task.
- Tight coupling between modules means a change to one file requires loading several others to understand impact. A loosely coupled module can be provided as complete, self-contained context.
- Duplicate logic scattered across the codebase forces agents to either load redundant context or miss instances when making changes.
The correction loop multiplier:
A correction loop where the agent’s first output is wrong, reviewed, and re-prompted uses roughly three times the tokens of a successful first attempt. Poor code quality increases agent error rates, multiplying both the per-request token cost and the number of iterations required.
Refactoring for token efficiency:
Refactoring for human readability and refactoring for token efficiency are the same work. The changes that help a human understand code at a glance help an agent understand it with minimal context.
- Use domain language in identifiers. Names should match the language of the business domain.
calculateMonthlyPremiumis better thancalcPremorcompute. - Establish a ubiquitous language - a consistent glossary of terms used uniformly across code, tests, tickets, and documentation. Agents generalize more accurately when terminology is consistent.
- Extract functions until each has a single, nameable purpose. A function that can be described in one sentence can usually be understood without loading its callers.
- Apply responsibility separation at the module level. A module that owns one concept can be passed to an agent as complete, self-contained context.
- Define explicit interfaces at module boundaries. An agent working inside a module needs only the interface contract for its dependencies, not the implementation.
- Consolidate duplicate logic into one authoritative location. One definition is one context load; ten copies are ten opportunities for inconsistency.
Treat AI interaction quality as feedback on code quality. When an interaction requires more context than expected or produces worse output than expected, treat that as a signal that the code needs naming or structure improvement. Prioritize the most frequently changed files - use code churn data to identify where structural investment has the highest leverage.
Enforcing these improvements through the pipeline:
Structural and naming improvements degrade without enforcement. Two pipeline mechanisms keep them from slipping back:
- The architectural conformance agent catches code that crosses module boundaries or introduces prohibited dependencies. Running it as a pipeline gate means architecture decisions made during refactoring are protected on every subsequent change, not just until the next deadline.
- Pre-commit linting and style enforcement (part of the pre-feature baseline) catches naming violations before they reach review. Rules can encode domain language standards - rejecting generic names, enforcing consistent terminology - so that the ubiquitous language is maintained automatically rather than by convention.
Without pipeline enforcement, naming and structure improvements are temporary. With it, the token cost reductions they deliver compound over the lifetime of the codebase.
Self-correction through gate feedback:
When an agent generates code, gate failures from the architectural conformance agent or linting checks become structured feedback the agent can act on directly. Rather than routing violations to a human reviewer, the pipeline returns the failure reason to the agent, which corrects the violation and resubmits. This self-correction cycle keeps naming and structure improvements in place without human intervention on each change - the pipeline teaches the agent what the codebase standards require, one correction at a time. Over repeated cycles, the correction rate drops as the agent internalizes the constraints, reducing both rework tokens and review burden.
Applying Tokenomics to ACD Architecture
Agentic CD (ACD) creates predictable token cost patterns because the workflow is structured. Apply optimization at each stage:
Specification stages (Intent Description through Acceptance Criteria): These are human-authored. Keep them concise and structured. Verbose intent descriptions do not produce better agent outputs - they produce more expensive ones. A bloated intent description that takes 2,000 tokens to say what 200 tokens would cover costs 10x more at every downstream stage that receives it.
Test Generation: The agent receives the user-facing behavior, feature description, and acceptance criteria. Pass only these three artifacts, not the full conversation history or unrelated system context. An agent that receives the full conversation history instead of just the three specification artifacts consumes 3-5x more tokens with no quality improvement.
Implementation: The implementation agent receives the test specification and feature description. It does not need the intent description (that informed the specification). Pass what the agent needs for this step only.
Expert validation agents: Validation agents running in parallel as pipeline gates should receive the artifact being validated plus the specification it must conform to - not the complete pipeline context. A test fidelity agent checking whether generated tests match the specification does not need the implementation or deployment history. For a concrete application of model routing, structured outputs, prompt caching, and per-session measurement to a specific agent configuration, see Coding & Review Setup.
Review queues: Agent-generated change volume can inflate review-time token costs when reviewers use AI-assisted review tools. WIP limits on the agent’s change queue (see Pitfalls) also function as a cost control on downstream AI review consumption.
The Constraint Framing
Tokenomics is a design constraint, not a post-hoc optimization. Teams that treat it as a constraint make different architectural decisions:
- Agent interfaces are designed to pass the minimum necessary context
- Output formats are chosen for machine consumption, not human readability
- Model selection is part of the architecture decision, not the implementation detail
- Cost per workflow execution is a metric with an owner, not a line item on a cloud bill
Ignoring tokenomics produces the same class of problems as ignoring latency: systems that work in development but fail under production load, accumulate costs that outpace value delivered, and require expensive rewrites to fix architectural mistakes.
Related Content
- Agentic Architecture Patterns - cross-cutting concerns including idempotency, model-agnostic abstraction, and structured inter-agent communication
- ACD - the framework overview, constraints, and workflow
- Agent Delivery Contract - the structured artifacts that token-efficient inter-agent communication depends on
- Pipeline Enforcement and Expert Agents - expert agents that run as pipeline gates and whose own token costs should be managed
- Pitfalls and Metrics - failure modes including review queue backup that compound token costs
- AI Adoption Roadmap - the sequence of prerequisites before optimizing agentic workflows
- Coding & Review Setup - a concrete application of model routing, structured outputs, prompt caching, and per-session measurement
Content contributed by Bryan Finster
3 - Pitfalls and Metrics
Each pitfall below has a root cause in the same two gaps: skipped agent delivery contract and absent pipeline enforcement. Fix those two things and most of these failures become impossible.
Key Pitfalls
1. Agent defines its own test scenarios
The failure is not the agent writing test code. It is the agent deciding what to test. When the agent defines both the test scenarios and the implementation, the tests are shaped to pass the code rather than verify the intent.
Humans define the test specifications before implementation begins. Scenarios, edge cases, acceptance criteria. The agent generates the test code from those specifications.
Validate agent-generated test code for two properties. First, it must test observable behavior, not implementation internals. Second, it must faithfully cover what the human specified. Skipping this validation is the most common way ACD fails.
What to do: Define test specifications (BDD scenarios and acceptance criteria) before any code generation. Use a test fidelity agent to validate that generated test code matches the specification. Review agent-generated test code for implementation coupling before approving it.
2. Review queue backs up from agent-generated volume
Agent speed should not pressure humans to review faster. If unreviewed changes accumulate, the temptation is to rubber-stamp reviews or merge without looking.
What to do: Apply WIP limits to the agent’s change queue. If three changes are awaiting review, the agent stops generating new changes until the queue drains. Treat agent-generated review queue depth as a pipeline metric. Consider adopting expert validation agents to handle mechanical review checks, reserving human review for judgment calls.
3. Tests pass so the change must be correct
Passing tests is necessary but not sufficient. Tests cannot verify intent, architectural fitness, or maintainability. A change can pass every test and still introduce unnecessary complexity, violate unstated conventions, or solve the wrong problem.
What to do: Human review remains mandatory for agent-generated changes. Focus reviews on intent alignment and architectural fit rather than mechanical correctness (the pipeline handles that). Track how often human reviewers catch issues that tests missed to calibrate your test coverage.
4. No provenance tracking for agent-generated changes
Without provenance tracking, you cannot learn from agent-generated failures, audit agent behavior, or improve the agent’s constraints over time. When a production incident involves agent-generated code, you need to know which agent, which prompt, and which intent description produced it.
What to do: Tag every agent-generated commit with the agent identity, the intent description, and the prompt or context used. Include provenance metadata in your deployment records. Review agent provenance data during incident retrospectives.
5. Agent improves code outside the session scope
Agents trained to write good code will opportunistically refactor, rename, or improve things they encounter while implementing a scenario. The intent is not wrong. The scope is.
A session implementing Scenario 2 that also cleans up the module from Scenario 1 produces a commit that cannot be cleanly reviewed. The scenario change and the cleanup are mixed. If the cleanup introduces a regression, the bisect trail is contaminated. The Boy Scout Rule (leave the code better than you found it) is sound engineering, but applying it within a feature session conflicts with the small-batch discipline that makes agent-generated work reviewable.
What to do: Define scope boundaries explicitly in the system prompt and context. Cleanup is valid work - but as a separate, explicitly scoped session with its own intent description and commit.
Example scope constraint to include in every implementation session:
Implement the behavior described in this scenario and only that behavior.
If you encounter code that could be improved, note it in your summary
but do not change it. Any refactoring, renaming, or cleanup must happen
in a separate session with its own commit. The only code that may change
in this session is the code required to make the acceptance test pass.When cleanup is warranted, schedule it explicitly: create a session scoped to that specific cleanup, commit it separately, and include the cleanup rationale in the intent description. This keeps the bisect trail clean and the review scope bounded.
6. Agent resumes mid-feature without a context reset
When a session is interrupted - by a pipeline failure, a context limit, or an agent timeout - there is a temptation to continue the session rather than close it out. The agent “already knows” what it was doing.
This is a reliability trap. Agent state is not durable in the way a commit is durable. A session that continues past an interruption carries implicit assumptions about what was completed that may not match the actual committed state. The next session should always start from the committed state, not from the memory of a previous session.
What to do: Treat any interruption as a session boundary. Before the next session begins, write the context summary based on what is actually committed, not what the agent believed it completed. If nothing was committed, the session produced nothing - start fresh from the last green state.
7. Review agent precision is miscalibrated
Miscalibration is not visible until an incident reveals it. The team does not know the review agent is generating false positives until developers stop reading its output. They do not know it is missing issues until a production failure traces back to something the agent approved. Miscalibration breaks in both directions:
Too many false positives: the review agent flags issues that are not real problems. Developers learn to dismiss the agent’s output without reading it. Real issues get dismissed alongside noise. The agent becomes a checkbox rather than a check.
Too few flags: the review agent misses issues that human reviewers would catch. The team gains confidence in the agent and reduces human review depth. Issues that should have been caught are not caught.
What to do: During the replacement cycle for review agents, track disagreements between the agent and human reviewers, not just agreement. When the agent flags something the human dismisses as noise, that is a false positive. When the human catches something the agent missed, that is a false negative. Track both. Set a threshold for acceptable false positive and false negative rates before reducing human review coverage. Review these rates monthly.
8. Skipped the prerequisite delivery practices
Teams jump to ACD without the delivery foundations: no deterministic pipeline, no automated tests, no fast feedback loops. AI amplifies whatever system it is applied to. Without guardrails, agents generate defects at machine speed.
What to do: Follow the AI Adoption Roadmap sequence. The first four stages (Quality Tools, Clarify Work, Harden Guardrails, Reduce Delivery Friction) are prerequisites, not optional. Do not expand AI to code generation until the pipeline is deterministic and fast.
After Adoption: Sustaining Quality Over Time
Agents generate code faster than humans refactor it. Without deliberate maintenance practice, the codebase drifts toward entropy faster than it would with human-paced development.
Keep skills and prompts under version control
The system prompt, session templates, agent configuration, and any skills used in your pipeline are first-class artifacts. They belong in version control alongside the code they produce. An agent operating from an outdated skill file or an untracked system prompt is an unreviewed change to your delivery process.
Review your agent configuration on the same cadence you review the pipeline. When an agent produces unexpected output, check the configuration before assuming the model changed.
Schedule refactoring as explicit sessions
The rule against out-of-scope changes (pitfall 5 above) applies to feature sessions. It does not mean cleanup never happens. It means cleanup is planned and scoped like any other work.
A practical pattern: after every three to five feature sessions, schedule a maintenance session scoped to the files touched during those sessions. The intent description names what to clean up and why. The session produces a single commit with no behavior change. The acceptance criteria are that all existing tests still pass.
Example maintenance session prompt:
Refactor the files listed below. The goal is to improve readability and
reduce duplication introduced during the last four feature sessions.
Constraints:
- No behavior changes. All existing tests must pass unchanged.
- No new features, even small ones.
- No changes outside the listed files.
- If you find something that requires a behavior change to fix properly,
note it but do not fix it in this session.
Files in scope:
[list files]Track skill effectiveness over time
Agent skills accumulate technical debt the same way code does. A skill written six months ago may no longer reflect the current page structure, template conventions, or style rules. Review each skill when the templates or conventions it references change. Add an “updated” date to each skill’s front matter so you can identify which ones are stale.
When a skill produces output that requires significant correction, update the skill before running it again. Unaddressed skill drift means every future session repeats the same corrections.
Prune dead context
Agent sessions accumulate context over time: outdated summaries, resolved TODOs, stale notes about work that was completed months ago. This dead context increases session startup cost and can mislead the agent about current state.
Review the context documents for each active workstream quarterly. Archive or delete summaries for completed work. Update the “current state” description to reflect what is actually true about the codebase, not what was true when the session was first created.
Measuring Success
| Metric | Target | How to Measure |
|---|---|---|
| Agent-generated change failure rate | Equal to or lower than human-generated | Tag agent-generated deployments in your deployment tracker. Compare rollback and incident rates between agent and human changes over rolling 30-day windows. |
| Review time for agent-generated changes | Comparable to human-generated changes | Measure time from “change ready for review” to “review complete” for both agent and human changes. If agent reviews are significantly faster, reviewers may be rubber-stamping. |
| Test coverage for agent-generated code | Higher than baseline | Run coverage reports filtered by agent-generated files. Compare against team baseline. If agent code coverage is lower, the test generation step is not working. |
| Agent-generated changes with complete artifacts | 100% | Audit a sample of recent agent-generated changes monthly. Check whether each has an intent description, test specification, feature description, and provenance metadata. |
Related Content
- ACD - the framework overview, eight constraints, and workflow
- Agent Delivery Contract - the artifacts that prevent these pitfalls
- Pipeline Enforcement and Expert Agents - the automated checks that catch failures
- AI Adoption Roadmap - the prerequisite sequence that prevents most of these pitfalls
- Code Coverage Mandates - an anti-pattern especially dangerous when agents optimize for coverage rather than intent
- Pressure to Skip Testing - an anti-pattern that ACD counters by making test-first workflow mandatory
- High Coverage but Ineffective Tests - a testing symptom that undermines the acceptance criteria agents depend on