Agent Delivery Contract
11 minute read
Each artifact has a defined authority. When an agent detects a conflict between artifacts, it cannot resolve that conflict by modifying the artifact it does not own. The feature description wins over the implementation. The intent description wins over the feature description.
For the framework overview and the eight constraints, see ACD.
1. Intent Description
What it is: A self-contained problem statement, written by a human, that defines what the change should accomplish and why.
An agent (or a new team member) receiving only this document should understand the problem without asking clarifying questions. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed. See the self-containment test for how to verify completeness.
Include a hypothesis. The intent should state what outcome the change is expected to produce and why. A useful format: “We believe [this change] will result in [this outcome] because [this reason].” The hypothesis makes the “why” testable, not just stated. After deployment, the team can check whether the predicted outcome actually occurred - connecting each change to the metrics-driven improvement cycle.
Example:
Key property: The intent description is authored and owned by a human. The agent does not write or modify it.
2. User-Facing Behavior
What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.
Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value. BDD scenarios work well here:
Key property: Humans define the scenarios. The agent generates code to satisfy them but does not decide what scenarios to include.
3. Feature Description (Constraint Architecture)
What it is: The architectural constraints, dependencies, and trade-off boundaries that govern the implementation.
Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply. It separates hard boundaries (musts, must nots) from soft preferences and escalation triggers so the agent knows which constraints are non-negotiable.
Example:
Key property: Engineering owns the architectural decisions. The agent implements within these constraints but does not change them. When the agent encounters a condition listed as an escalation trigger, it must stop and ask rather than deciding autonomously.
4. Acceptance Criteria
What it is: Concrete expectations that can be executed as deterministic tests or evaluated by review agents. These are the authoritative source of truth for what the code should do.
This artifact has two parts: the done definition (observable outcomes an independent observer could verify) and the evaluation design (test cases with known-good outputs that catch regressions). Together they constrain the agent. If the criteria are comprehensive, the agent cannot generate incorrect code that passes. If the criteria are shallow, the agent can generate code that passes tests but does not satisfy the intent.
Acceptance criteria
Write acceptance criteria as observable outcomes, not internal implementation details. Each criterion should be verifiable by someone who has never seen the code:
Evaluation design
Define test cases with known-good outputs so the agent (and the pipeline) can verify correctness mechanically:
Humans define the done definition and evaluation design. An agent can generate the test code, but the resulting tests must be decoupled from implementation (verify observable behavior, not internal details) and faithful to the specification (actually exercise what the human defined, without quietly omitting edge cases or weakening assertions). The test fidelity and implementation coupling agents enforce these two properties at pipeline speed.
Connecting acceptance criteria to hypothesis validation
Acceptance criteria answer “does the code work?” The hypothesis in the intent description asks a broader question: “did the change achieve its purpose?” These are different checks that happen at different times.
Acceptance criteria run in the pipeline on every commit. Hypothesis validation happens after deployment, using production data. In the rate-limiting example, the acceptance criteria verify that the 101st request returns a 429 status. The hypothesis - that p99 latency for well-behaved clients drops by 40% - is validated by observing production metrics after the change is live.
This connection matters because a change can pass all acceptance criteria and still fail its hypothesis. Rate limiting might work perfectly and yet not reduce latency because the root cause was something else entirely. When that happens, the team has learned something valuable: the problem is not what they thought it was. That learning feeds back into the next intent description.
The metrics-driven improvement page describes the full post-deployment validation loop. Hypothesis framing in the specification connects each individual change to the team’s continuous improvement cycle - every deployed change either confirms or refutes a prediction, producing a feedback signal whether it “succeeds” or not.
Key property: The pipeline enforces these tests on every commit. If they fail, the agent’s implementation is rejected regardless of how plausible the code looks.
5. Implementation
What it is: The actual code that implements the feature. In ACD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.
The implementation is the artifact most likely to be agent-generated. It must satisfy the acceptance criteria (tests), conform to the feature description (architecture), and achieve the intent description (purpose).
Example - agent-generated rate limiting middleware that satisfies the acceptance criteria above:
Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:
- Does the implementation match the intent? (Not just “does it pass tests?”)
- Does it follow the architectural constraints in the feature description?
- Does it introduce unnecessary complexity, dependencies, or security risks?
- Would a human developer on the team understand and maintain this code?
Key property: The implementation has the lowest authority of any artifact. When it conflicts with the feature description, tests, or intent, the implementation changes.
6. System Constraints
What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes. Agents need these stated explicitly because they cannot infer organizational norms from context.
Example:
Key property: System constraints apply globally. Unlike other artifacts that are per-change, these rules apply to every change in the system.
Artifact Authority Hierarchy
When an agent detects a conflict between artifacts, it must know which one wins. The hierarchy below defines precedence. A higher-priority artifact overrides a lower-priority one:
| Priority | Artifact | Authority |
|---|---|---|
| 1 (highest) | Intent Description | Defines the why; all other artifacts conform to it |
| 2 | User-Facing Behavior | Defines observable outcomes from the user’s perspective; feeds into Acceptance Criteria |
| 3 | Feature Description (Constraint Architecture) | Defines architectural constraints; implementation must conform |
| 4 | Acceptance Criteria | Pipeline-enforced; implementation must pass. Derived from User-Facing Behavior (functional) and Feature Description (non-functional requirements stated as architectural constraints) |
| 5 | System Constraints | Global; applies to every change in the system |
| 6 (lowest) | Implementation | Must satisfy all other artifacts |
Acceptance Criteria are derived from two sources. User-Facing Behavior defines the functional expectations (BDD scenarios). Non-functional requirements (latency budgets, resilience, security) must be stated explicitly as architectural constraints in the Feature Description. Both feed into Acceptance Criteria, which the pipeline enforces.
These Artifacts Are Pipeline Inputs, Not Reference Documents
The pipeline and agents consume these artifacts as inputs. They are not outputs for humans to read after the fact.
Without them, an agent that detects a conflict between what the acceptance criteria expect and what the feature description says has no way to determine which is authoritative. It guesses, and it guesses wrong. With explicit authority on each artifact, the agent knows which artifact wins.
These artifacts are valuable in any project. In ACD, they become mandatory because the pipeline and agents consume them as inputs, not just as reference for humans.
With the artifacts defined, the next question is how the pipeline enforces consistency between them. See Pipeline Enforcement and Expert Agents.
Related Content
- ACD - the framework overview, eight constraints, and workflow
- Pipeline Enforcement and Expert Agents - how the pipeline enforces artifact consistency
- Pitfalls and Metrics - common failure modes when artifacts are incomplete
- AI Adoption Roadmap - the prerequisite sequence before adopting artifact-driven workflows
- Agent-Assisted Specification - how to write clear intent descriptions and BDD scenarios that agents can implement reliably
- The Four Prompting Disciplines - the skills that produce these artifacts
- Testing - testing strategies that inform acceptance criteria
- Hypothesis-Driven Development - the foundational practice of treating every change as an experiment