This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Specification & Contracts

The delivery artifacts that define intent, behavior, and constraints for agent-generated changes - framed as hypotheses so each change validates whether it achieved its purpose.

1: Agent Delivery Contract
2: Agent-Assisted Specification

Every ACD change is anchored by structured delivery artifacts. When each change is framed as a hypothesis - “We believe [this change] will produce [this outcome]” - the artifacts do double duty: they define what to build and how to validate whether building it achieved its purpose. These pages define the artifacts agents must respect and explain how agents help sharpen specifications before any code is written.

1 - Agent Delivery Contract

Detailed definitions and examples for the artifacts that agents and humans should maintain in an ACD pipeline.

Each artifact has a defined authority. When an agent detects a conflict between artifacts, it cannot resolve that conflict by modifying the artifact it does not own. The feature description wins over the implementation. The intent description wins over the feature description.

For the framework overview and the eight constraints, see ACD.

1. Intent Description

What it is: A self-contained problem statement, written by a human, that defines what the change should accomplish and why.

An agent (or a new team member) receiving only this document should understand the problem without asking clarifying questions. It defines what the change should accomplish, not how. Without a clear intent description, the agent may generate technically correct code that does not match what was needed. See the self-containment test for how to verify completeness.

Include a hypothesis. The intent should state what outcome the change is expected to produce and why. A useful format: “We believe [this change] will result in [this outcome] because [this reason].” The hypothesis makes the “why” testable, not just stated. After deployment, the team can check whether the predicted outcome actually occurred - connecting each change to the metrics-driven improvement cycle.

Example:

Intent description: add rate limiting to /api/search

## Intent: Add rate limiting to the /api/search endpoint

We are receiving complaints about slow response times during peak hours.
Analysis shows that a small number of clients are making thousands of
requests per minute. We need to limit each authenticated client to 100
requests per minute on the /api/search endpoint. Requests that exceed
the limit should receive a 429 response with a Retry-After header.

**Hypothesis:** We believe rate limiting will reduce p99 latency for
well-behaved clients by 40% because abusive clients currently consume
60% of search capacity.

Key property: The intent description is authored and owned by a human. The agent does not write or modify it.

2. User-Facing Behavior

What it is: A description of how the system should behave from the user’s perspective, expressed as observable outcomes.

Agents can generate code that satisfies tests but does not produce the expected user experience. User-facing behavior descriptions bridge the gap between technical correctness and user value. BDD scenarios work well here:

BDD scenarios: rate limit user-facing behavior

Scenario: Client exceeds rate limit
  Given an authenticated client
  And the client has made 100 requests in the current minute
  When the client makes another request to /api/search
  Then the response status should be 429
  And the response should include a Retry-After header
  And the Retry-After value should indicate when the limit resets

Scenario: Client within rate limit
  Given an authenticated client
  And the client has made 50 requests in the current minute
  When the client makes a request to /api/search
  Then the request should be processed normally
  And the response should include rate limit headers showing remaining quota

Key property: Humans define the scenarios. The agent generates code to satisfy them but does not decide what scenarios to include.

3. Feature Description (Constraint Architecture)

What it is: The architectural constraints, dependencies, and trade-off boundaries that govern the implementation.

Agents need explicit architectural context that human developers often carry in their heads. The feature description tells the agent where the change fits in the system, what components it touches, and what constraints apply. It separates hard boundaries (musts, must nots) from soft preferences and escalation triggers so the agent knows which constraints are non-negotiable.

Example:

Feature description: rate limiting constraint architecture

## Feature: Rate Limiting for Search API

### Musts
- Rate limit middleware sits between authentication and the search handler
- Rate limit state is stored in Redis (shared across application instances)
- Rate limit configuration is read from the application config, not hardcoded
- Must work correctly with horizontal scaling (3-12 instances)
- Must be configurable per-endpoint (other endpoints may have different limits later)

### Must Nots
- Must not add more than 5ms of latency to the request path
- Must not introduce new external dependencies (Redis client library already in use for session storage)

### Preferences
- Prefer middleware pattern over decorator pattern for request interception
- Prefer sliding window counter over fixed window for smoother rate distribution

### Escalation Triggers
- If Redis is unavailable, stop and ask whether to fail open (allow all requests) or fail closed (reject all requests)
- If the existing auth middleware does not expose the client ID, stop and ask rather than modifying the auth layer

Key property: Engineering owns the architectural decisions. The agent implements within these constraints but does not change them. When the agent encounters a condition listed as an escalation trigger, it must stop and ask rather than deciding autonomously.

4. Acceptance Criteria

What it is: Concrete expectations that can be executed as deterministic tests or evaluated by review agents. These are the authoritative source of truth for what the code should do.

This artifact has two parts: the done definition (observable outcomes an independent observer could verify) and the evaluation design (test cases with known-good outputs that catch regressions). Together they constrain the agent. If the criteria are comprehensive, the agent cannot generate incorrect code that passes. If the criteria are shallow, the agent can generate code that passes tests but does not satisfy the intent.

Acceptance criteria

Write acceptance criteria as observable outcomes, not internal implementation details. Each criterion should be verifiable by someone who has never seen the code:

Acceptance criteria: rate limiting done definition

1. An authenticated client making 100 requests in one minute receives normal
   responses with rate limit headers showing remaining quota
2. An authenticated client making a 101st request in the same minute receives
   a 429 response with a Retry-After header indicating when the limit resets
3. After the rate limit window expires, the previously limited client can make
   requests again normally
4. A different authenticated client is unaffected by another client's rate
   limit status
5. The rate limit middleware adds less than 5ms to p99 request latency

Evaluation design

Define test cases with known-good outputs so the agent (and the pipeline) can verify correctness mechanically:

Evaluation design: rate limiting test cases

**Test Case 1 (Happy Path):** Client sends 50 requests in one minute.
Result: All return 200 with X-RateLimit-Remaining headers counting down.

**Test Case 2 (Limit Exceeded):** Client sends 101 requests in one minute.
Result: Request 101 returns 429 with Retry-After header.

**Test Case 3 (Window Reset):** Client exceeds limit, then the window expires.
Result: Next request returns 200.

**Test Case 4 (Per-Client Isolation):** Client A exceeds limit. Client B sends
a request. Result: Client B receives 200.

**Test Case 5 (Latency Budget):** Single request with rate limit check.
Result: Middleware adds less than 5ms.

Humans define the done definition and evaluation design. An agent can generate the test code, but the resulting tests must be decoupled from implementation (verify observable behavior, not internal details) and faithful to the specification (actually exercise what the human defined, without quietly omitting edge cases or weakening assertions). The test fidelity and implementation coupling agents enforce these two properties at pipeline speed.

Connecting acceptance criteria to hypothesis validation

Acceptance criteria answer “does the code work?” The hypothesis in the intent description asks a broader question: “did the change achieve its purpose?” These are different checks that happen at different times.

Acceptance criteria run in the pipeline on every commit. Hypothesis validation happens after deployment, using production data. In the rate-limiting example, the acceptance criteria verify that the 101st request returns a 429 status. The hypothesis - that p99 latency for well-behaved clients drops by 40% - is validated by observing production metrics after the change is live.

This connection matters because a change can pass all acceptance criteria and still fail its hypothesis. Rate limiting might work perfectly and yet not reduce latency because the root cause was something else entirely. When that happens, the team has learned something valuable: the problem is not what they thought it was. That learning feeds back into the next intent description.

The metrics-driven improvement page describes the full post-deployment validation loop. Hypothesis framing in the specification connects each individual change to the team’s continuous improvement cycle - every deployed change either confirms or refutes a prediction, producing a feedback signal whether it “succeeds” or not.

Key property: The pipeline enforces these tests on every commit. If they fail, the agent’s implementation is rejected regardless of how plausible the code looks.

5. Implementation

What it is: The actual code that implements the feature. In ACD, this may be generated entirely by the agent, co-authored by agent and human, or authored by a human with agent assistance.

The implementation is the artifact most likely to be agent-generated. It must satisfy the acceptance criteria (tests), conform to the feature description (architecture), and achieve the intent description (purpose).

Example - agent-generated rate limiting middleware that satisfies the acceptance criteria above:

Implementation: agent-generated rate limiting middleware

function rateLimitMiddleware(redisClient, config) {
  return async function (req, res, next) {
    if (!req.user) {
      return next();
    }

    const limit = config.getLimit(req.path);
    if (!limit) {
      return next();
    }

    const key = `rate_limit:${req.user.id}:${req.path}`;
    const current = await redisClient.incr(key);
    if (current === 1) {
      await redisClient.expire(key, 60);
    }

    const ttl = await redisClient.ttl(key);
    if (current > limit) {
      res.set("Retry-After", String(ttl));
      return res.status(429).end();
    }

    res.set("X-RateLimit-Remaining", String(limit - current));
    next();
  };
}

Review requirements: Agent-generated implementation must be reviewed by a human before merging to trunk. The review focuses on:

Does the implementation match the intent? (Not just “does it pass tests?”)
Does it follow the architectural constraints in the feature description?
Does it introduce unnecessary complexity, dependencies, or security risks?
Would a human developer on the team understand and maintain this code?

Key property: The implementation has the lowest authority of any artifact. When it conflicts with the feature description, tests, or intent, the implementation changes.

6. System Constraints

What it is: Non-functional requirements, security policies, performance budgets, and organizational rules that apply to all changes. Agents need these stated explicitly because they cannot infer organizational norms from context.

Example:

System constraints: global non-functional requirements

system_constraints:
  security:
    - No secrets in source code
    - All user input must be sanitized
    - Authentication required for all API endpoints
  performance:
    - API p99 latency < 500ms
    - No N+1 query patterns
    - Database queries must use indexes
  architecture:
    - No circular dependencies between modules
    - External service calls must use circuit breakers
    - All new dependencies require team approval
  operations:
    - All new features must have monitoring dashboards
    - Log structured data, not strings
    - Feature flags required for user-visible changes

Key property: System constraints apply globally. Unlike other artifacts that are per-change, these rules apply to every change in the system.

Artifact Authority Hierarchy

When an agent detects a conflict between artifacts, it must know which one wins. The hierarchy below defines precedence. A higher-priority artifact overrides a lower-priority one:

Priority	Artifact	Authority
1 (highest)	Intent Description	Defines the why; all other artifacts conform to it
2	User-Facing Behavior	Defines observable outcomes from the user’s perspective; feeds into Acceptance Criteria
3	Feature Description (Constraint Architecture)	Defines architectural constraints; implementation must conform
4	Acceptance Criteria	Pipeline-enforced; implementation must pass. Derived from User-Facing Behavior (functional) and Feature Description (non-functional requirements stated as architectural constraints)
5	System Constraints	Global; applies to every change in the system
6 (lowest)	Implementation	Must satisfy all other artifacts

Acceptance Criteria are derived from two sources. User-Facing Behavior defines the functional expectations (BDD scenarios). Non-functional requirements (latency budgets, resilience, security) must be stated explicitly as architectural constraints in the Feature Description. Both feed into Acceptance Criteria, which the pipeline enforces.

These Artifacts Are Pipeline Inputs, Not Reference Documents

The pipeline and agents consume these artifacts as inputs. They are not outputs for humans to read after the fact.

Without them, an agent that detects a conflict between what the acceptance criteria expect and what the feature description says has no way to determine which is authoritative. It guesses, and it guesses wrong. With explicit authority on each artifact, the agent knows which artifact wins.

These artifacts are valuable in any project. In ACD, they become mandatory because the pipeline and agents consume them as inputs, not just as reference for humans.

With the artifacts defined, the next question is how the pipeline enforces consistency between them. See Pipeline Enforcement and Expert Agents.

ACD - the framework overview, eight constraints, and workflow
Pipeline Enforcement and Expert Agents - how the pipeline enforces artifact consistency
Pitfalls and Metrics - common failure modes when artifacts are incomplete
AI Adoption Roadmap - the prerequisite sequence before adopting artifact-driven workflows
Agent-Assisted Specification - how to write clear intent descriptions and BDD scenarios that agents can implement reliably
The Four Prompting Disciplines - the skills that produce these artifacts
Testing - testing strategies that inform acceptance criteria
Hypothesis-Driven Development - the foundational practice of treating every change as an experiment

2 - Agent-Assisted Specification

How to use agents as collaborators during specification and why small-scope specification is not big upfront design.

The specification stages of the ACD workflow (Intent Description, User-Facing Behavior, Feature Description, and Acceptance Criteria) ask humans to define intent, behavior, constraints, and acceptance criteria before any code generation begins. This page explains how agents accelerate that work and why the effort stays small.

The Pattern

Every use of an agent in the specification stages follows the same four-step cycle:

Human drafts - write the first version based on your understanding
Agent critiques - ask the agent to find gaps, ambiguity, or inconsistency
Human decides - accept, reject, or modify the agent’s suggestions
Agent refines - generate an updated version incorporating your decisions

This is not the agent doing specification for you. It is the agent making your specification more thorough than it would be without help, in less time than it would take without help. The sections below show how this cycle applies at each specification stage.

This Is Not Big Upfront Design

The specification stages look heavy if you imagine writing them for an entire feature set. That is not what happens.

You specify the next single unit of work. One thin vertical slice of functionality - a single scenario, a single behavior. A user story may decompose into multiple such units worked in parallel across services. The scope of each unit stays small because continuous delivery requires it: every change must be small enough to deploy safely and frequently. A detailed specification for three months of work does not reduce risk - it amplifies it. Small-scope specification front-loads clarity on one change and gets production feedback before specifying the next.

If your specification effort for a single change takes more than 15 minutes, the change is too large. Split it.

How Agents Help with the Intent Description

The intent description does not need to be perfect on the first draft. Write a rough version and use an agent to sharpen it.

Ask the agent to find ambiguity. Give it your draft intent and ask it to identify anything vague, any assumption that a developer might interpret differently than you intended, or any unstated constraint.

Example prompt:

Prompt: identify ambiguity in intent description

Here is the intent description for my next change. Identify any
ambiguity, unstated assumptions, or missing context that could
lead to an implementation that technically satisfies this description
but does not match what I actually want.

[paste intent description]

Ask the agent to suggest edge cases. Agents are good at generating boundary conditions you might not think of, because they can quickly reason through combinations.

Ask the agent to simplify. If the intent covers too much ground, ask the agent to suggest how to split it into smaller, independently deliverable changes.

Ask the agent to sharpen the hypothesis. If the intent includes a hypothesis (“We believe X will produce Y because Z”), the agent can pressure-test it before any code is written.

Example prompt:

Prompt: sharpen the hypothesis in the intent description

Review this hypothesis. Is the expected outcome measurable with data
we currently collect? Is the causal reasoning plausible? What
alternative explanations could produce the same outcome without this
change being the cause?

[paste intent description with hypothesis]

A weak hypothesis - one with an unmeasurable outcome or implausible causal link - will not produce useful feedback after deployment. Catching that now costs a prompt. Catching it after implementation costs a cycle.

The human still owns the intent. The agent is a sounding board that catches gaps before they become defects.

How Agents Help with User-Facing Behavior

Writing BDD scenarios from scratch is slow. Agents can draft them and surface gaps you would otherwise miss.

Generate initial scenarios from the intent. Give the agent your intent description and ask it to produce Gherkin scenarios covering the expected behavior.

Example prompt:

Prompt: generate BDD scenarios from intent description

Based on this intent description, generate BDD scenarios in Gherkin
format. Cover the primary success path, key error paths, and edge
cases. For each scenario, explain why it matters.

[paste intent description]

Review for completeness, not perfection. The agent’s first draft will cover the obvious paths. Your job is to read through them and ask: “What is missing?” The agent handles volume. You handle judgment.

Ask the agent to find gaps. After reviewing the initial scenarios, ask the agent explicitly what scenarios are missing.

Example prompt:

Prompt: identify missing BDD scenarios

Here are the BDD scenarios for this feature. What scenarios are
missing? Consider boundary conditions, concurrent access, failure
modes, and interactions with existing behavior.

[paste scenarios]

Ask the agent to challenge weak scenarios. Some scenarios may be too vague to constrain an implementation. Ask the agent to identify any scenario where two different implementations could both pass while producing different user-visible behavior.

The human decides which scenarios to keep. The agent ensures you considered more scenarios than you would have on your own.

How Agents Help with the Feature Description and Acceptance Criteria

The Feature Description and Acceptance Criteria stages define the technical boundaries: where the change fits in the system, what constraints apply, and what non-functional requirements must be met.

Ask the agent to suggest architectural considerations. Give it the intent, the BDD scenarios, and a description of the current system architecture. Ask what integration points, dependencies, or constraints you should document.

Example prompt:

Prompt: identify architectural considerations before implementation

Given this intent and these BDD scenarios, what architectural
decisions should I document before implementation begins? Consider
where this change fits in the existing system, what components it
touches, and what constraints an implementer needs to know.

Current system context: [brief architecture description]

Ask the agent to draft non-functional acceptance criteria. Agents can suggest performance thresholds, security requirements, and resource limits based on the type of change and its context.

Example prompt:

Prompt: draft non-functional acceptance criteria

Based on this feature description, suggest non-functional acceptance
criteria I should define. Consider latency, throughput, security,
resource usage, and operational requirements. For each criterion,
explain why it matters for this specific change.

[paste feature description]

Ask the agent to check consistency. Once you have the intent, BDD scenarios, feature description, and acceptance criteria, ask the agent to identify any contradictions or gaps between them.

The human makes the architectural decisions and sets the thresholds. The agent makes sure you did not leave anything out.

Validating the Complete Specification Set

The four specification stages produce four artifacts: intent description, user-facing behavior (BDD scenarios), feature description (constraint architecture), and acceptance criteria. Each can look reasonable in isolation but still conflict with the others. Before moving to test generation and implementation, validate them as a set.

Use an agent as a specification reviewer. Give it all four artifacts and ask it to check for internal consistency.

Specification consistency prompt

Prompt: validate specification set for internal consistency

Review these four specification artifacts for internal consistency
before implementation begins. Check:
- Clarity: is the intent unambiguous? Could it be read differently by two developers?
- Testability: does every BDD scenario have clear, observable outcomes?
- Scope: does the feature description constrain the implementation to what the intent requires, without over-engineering?
- Terminology: are the same concepts named consistently across all four artifacts?
- Completeness: are there behaviors implied by the intent that have no corresponding BDD scenario?
- Conflict: does anything in one artifact contradict anything in another?
- Hypothesis: if the intent includes a hypothesis, is there a corresponding validation path? Can the predicted outcome be measured after deployment?

[paste all four artifacts]

The human gates on this review before implementation begins. If the review agent identifies issues, resolve them before generating any test code or implementation. A conflict caught in specification costs minutes. The same conflict caught during implementation costs a session.

This review is not a bureaucratic checkpoint. It is the last moment where the cost of a change is near zero. After this gate, every issue becomes more expensive to fix.

The Discovery Loop: From Conversation to Specification

The prompts above work well when you already know what to specify. When you do not, you need a different starting point. Instead of writing a draft and asking the agent to critique it, treat the agent as a principal architect who interviews you to extract context you did not know was missing.

This is the shift from “order taker” to “architectural interview.” The sections above describe what to do at each specification stage. The discovery loop describes how to get there through conversation when you are starting from a vague idea.

Phase 1: Initial Framing (Intent)

Describe the outcome, not the application. Set the agent’s role and the goal of the conversation explicitly.

Prompt: start the discovery loop

I want to build a Software Value Stream Mapping application. Before we
write a single line of code, I want you to act as a Principal Architect.
Your goal is to help me write a self-contained specification that an
autonomous agent can execute. Do not start writing the spec yet. First,
interview me to uncover the technical implementation details, edge cases,
and trade-offs I have not considered.

This prompt does three things: it states intent, it assigns a role that produces the right kind of questions, and it prevents the agent from jumping to implementation.

Even at this early stage, include a rough hypothesis about what outcome you expect: “I believe this tool will reduce the time teams spend on manual value stream analysis by 80%.” The hypothesis does not need to be precise yet - the discovery interview will sharpen it - but stating one early forces you to think about measurable outcomes from the start.

Phase 2: Deep-Dive Interview (Context)

Let the agent ask three to five high-signal questions at a time. The goal is to surface the implicit knowledge in your head: domain definitions, data schemas, failure modes, and trade-off preferences.

What the agent should ask: “How are we defining Lead Time versus Cycle Time for this specific organization? What is the schema of the incoming JSON? How should the system handle missing data points?”

Your role: Answer with as much raw context as possible. Do not worry about formatting. Get the “why” and “how” out. The agent will structure it later.

This is context engineering in practice: you are building the information environment the specification will formalize.

Phase 3: Drafting (Specification)

Once the agent has enough context, ask it to synthesize the conversation into a structured specification.

Prompt: synthesize into specification

Based on our discussion, generate the first draft of the specification
document. Structure it as: Intent Description, User-Facing Behavior
(BDD scenarios), Feature Description (architectural constraints),
Task Decomposition, and Acceptance Criteria (including evaluation
design with test cases). Ensure the Task Decomposition follows a
planner-worker pattern where tasks are broken into sub-two-hour chunks.

The sections map to the agent delivery contract and the specification engineering skill set. The agent drafts. You review using the same four-step cycle described at the top of this page.

Phase 4: Stress-Test Review

Before finalizing, ask the agent to find gaps in its own output.

Prompt: stress-test the specification

Critique this specification. Where would a junior developer or an
autonomous agent get confused? What constraints are still too vague?
What edge cases are missing from the evaluation design?

This is the same validation step as the specification consistency check, applied to the discovery loop’s output.

How This Differs from Turn-by-Turn Prompting

Step	Turn-by-turn prompting	Discovery loop
Beginning	Write a long prompt and hope for the best	State a high-level goal and ask to be interviewed
Development	Fix the agent’s code mistakes turn by turn	Fix the specification until it is agent-proof
Quality	Eyeball the result	Define evaluation design (test cases) up front
Hand-off	Copy-paste code into the editor	Hand the specification to a long-running worker

The discovery loop front-loads the work where it is cheapest: in conversation, before any code exists.

Tip: the running context log

During long discovery conversations, ask the agent to maintain a running context log of key decisions. This prevents core decisions from getting lost in the middle of the context window as the conversation grows. The context log becomes the raw material for Phase 3.

The complete specification example below shows the output this workflow produces.

Complete Specification Example

The four specification stages produce concise, structured documents. The example below shows what a complete specification looks like when all four disciplines from The Four Prompting Disciplines are applied. This is a real-scale example, not a simplified illustration.

Notice what makes this specification agent-executable: every section is self-contained, acceptance criteria are verifiable by an independent observer, the decomposition defines clear module boundaries, and test cases include known-good outputs.

Full specification: VSM-Automator (Alpha)

Complete specification example: VSM-Automator

# Specification: VSM-Automator (Alpha)

## 1. Intent Description

The goal is to build a web-based tool that visualizes the flow of software
delivery from "Commit" to "Production." The application must consume a
standardized JSON export of DORA metrics and Git events to render a horizontal
chevron-style map. It must calculate Lead Time, Cycle Time, and Process
Efficiency without manual data entry for the calculations.

## 2. Feature Description

**Musts:**

- Use TypeScript and React for the frontend to ensure type safety
- Implement D3.js or Mermaid.js for the flow visualization
- Data must stay in the local browser session (no external database for Alpha)

**Must Nots:**

- Do not use proprietary UI libraries (keep it to Tailwind CSS)
- Do not allow data uploads exceeding 10MB

**Preferences:**

- Prefer functional programming patterns over class-based components
- Prioritize dark mode as the default UI

**Escalation Triggers:**

- If the provided JSON schema is missing "Deployment Frequency" data, stop and
  ask the user for a fallback mapping strategy

## 3. Task Decomposition

This project is decomposed into four independent executable modules:

**Module A: Data Parsing and Normalization**

- Input: Raw JSON blob
- Output: A normalized ValueStream object containing an array of Stage objects
- Requirement: Handle date-string conversion to Unix timestamps for math
  operations

**Module B: Calculation Engine**

- Input: ValueStream object
- Logic:
  - Lead Time = Deployment Timestamp - First Commit Timestamp
  - Process Efficiency = (Active Work Time / Total Lead Time) x 100
- Output: Summary statistics object

**Module C: Visualization Layer**

- Input: Summary statistics and normalized stages
- Requirement: Render a responsive SVG where the width of each chevron is
  proportional to the time spent in that stage (logarithmic scale preferred
  if outliers exist)

**Module D: Export/Reporting**

- Input: Rendered SVG
- Output: Downloadable PNG or PDF report

## 4. Acceptance Criteria

1. The user can drag and drop a sample_data.json file, and a map renders in
   under 500ms
2. The calculated "Lead Time" on the screen matches the manual calculation of
   (TotalTime / NumberOfItems) within a 1% margin of error
3. Clicking a "Stage" chevron displays a modal showing the specific Git SHAs
   or Jira IDs associated with that bottleneck

## 5. Evaluation Design

**Test Case 1 (The Happy Path):** Upload a 5-stage pipeline with linear
timestamps. Result: Map renders correctly with 20% Process Efficiency.

**Test Case 2 (The Bottleneck):** Upload data where "Testing" takes 90% of
the total time. Result: The "Testing" chevron visually dominates the UI and
is highlighted in red.

**Test Case 3 (The Null Set):** Upload an empty JSON array. Result: System
displays a graceful "No Data Found" state rather than crashing.

What to notice:

Self-contained: An agent receiving only this document can implement without asking clarifying questions. That is the self-containment test.
Decomposed with boundaries: Each module has explicit inputs and outputs. An orchestrator can route each module to a separate agent session (see Small-Batch Sessions).
Acceptance criteria are observable: Each criterion describes a user-visible outcome, not an internal implementation detail. These map directly to Acceptance Criteria.
Test cases include expected outputs: The evaluation design gives the agent known-good results to verify against, which is the specification engineering skill of evaluation design.

The ACD Workflow - the full workflow these tips support
Agent Delivery Contract - detailed definitions of each artifact
The Four Prompting Disciplines - the skill framework that produces specifications like the example above
Small Batches - why changes must stay small enough for frequent, safe deployment
Hypothesis-Driven Development - the lifecycle for forming, testing, and validating hypotheses

Specification & Contracts

1 - Agent Delivery Contract

1. Intent Description

2. User-Facing Behavior

3. Feature Description (Constraint Architecture)

4. Acceptance Criteria

Acceptance criteria

Evaluation design

Connecting acceptance criteria to hypothesis validation

5. Implementation

6. System Constraints

Artifact Authority Hierarchy

These Artifacts Are Pipeline Inputs, Not Reference Documents

Related Content

2 - Agent-Assisted Specification

The Pattern

This Is Not Big Upfront Design

How Agents Help with the Intent Description

How Agents Help with User-Facing Behavior

How Agents Help with the Feature Description and Acceptance Criteria

Validating the Complete Specification Set

The Discovery Loop: From Conversation to Specification

Phase 1: Initial Framing (Intent)

Phase 2: Deep-Dive Interview (Context)

Phase 3: Drafting (Specification)

Phase 4: Stress-Test Review

How This Differs from Turn-by-Turn Prompting

Complete Specification Example

Related Content