Pipeline reference architectures for single-team, multi-team, and distributed service delivery, with quality gates sequenced by defect detection priority.
This section defines quality gates sequenced by defect detection priority and three
pipeline patterns that apply them. Quality gates are derived from the
Systemic Defect Fixes catalog and sequenced so the cheapest, fastest
checks run first.
Gates marked with [Pre-Feature] must be in place and passing before any new feature
work begins. They form the baseline safety net that every commit runs through. Adding
features without these gates means defects accumulate faster than the team can detect them.
Gates marked with ▲ are enhanced by AI - the AI shifts
detection earlier or catches issues that rule-based tools miss. See the
Systemic Defect Fixes catalog for details.
Quality Gates in Priority Sequence
The gate sequence follows a single principle: fail fast, fail cheap. Gates that catch
the most common defects with the least execution time run first. Each gate listed below
maps to one or more defect sources from the catalog.
Pre-commit Gates
These run on the developer’s machine before code leaves the workstation. They provide
sub-second to sub-minute feedback.
These gates must be active before starting feature work
Without these gates passing on every commit to trunk, defects accumulate faster than the
team can detect them. If any are missing, add them before writing new features. The
Foundations phase covers how to establish
this baseline.
Linting and formatting
Static type checking
Secret scanning
SAST for injection patterns
Compilation / build
Unit tests
Dependency vulnerability scan
Contract tests at every integration boundary
Schema migration validation
Pipeline Patterns
These three patterns apply the quality gates above to progressively more complex team
and deployment topologies. Most organizations start with Pattern 1 and evolve toward
Pattern 3 as team count and deployment independence requirements grow.
Multiple Teams, Single Deployable - multiple teams own
sub-domain modules within a shared modular monolith, each with its own sub-pipeline
feeding a thin integration pipeline
Each quality gate above is derived from the Systemic Defect Fixes
catalog. The catalog organizes defects by origin - product and discovery, integration,
knowledge, change and complexity, testing gaps, process, data, dependencies, security, and
performance. The pipeline gates are the automated enforcement points for the systemic
prevention strategies described in the catalog.
Gates marked with ▲ correspond to catalog entries where AI
shifts detection earlier than current rule-based automation. For expert agent patterns that
implement these gates in an agentic CD context, see
ACD Pipeline Enforcement.
When adding or removing gates, consult the catalog to ensure that no defect category loses
its detection point. A gate that seems redundant may be the only automated check for a
specific defect source.
Further Reading
For a deeper treatment of pipeline design, stage sequencing, and deployment strategies, see
Dave Farley’s
Continuous Delivery Pipelines which covers pipeline
architecture patterns in detail.
Phase 2: Pipeline - the migration phase that establishes the pipeline
Slow Pipelines - what happens when pipeline architecture is not optimized
ACD - additional pipeline constraints when AI agents contribute changes
1.1 - Single Team, Single Deployable
A linear pipeline pattern for a single team owning a modular monolith.
This architecture suits a team of up to 8-10 people owning a
modular monolith - a single deployable
application with well-defined internal module boundaries. The codebase is organized by
domain, not by technical layer. Each module encapsulates its own data, logic, and
interfaces, communicating with other modules through explicit internal APIs. The
application deploys as one unit, but its internal structure makes it possible to reason
about, test, and change one module without understanding the entire codebase. The pipeline
is linear with parallel stages where dependencies allow.
graph TD
classDef prefeature fill:#0d7a32,stroke:#0a6128,color:#fff
classDef ci fill:#224968,stroke:#1a3a54,color:#fff
classDef parallel fill:#30648e,stroke:#224968,color:#fff
classDef accept fill:#6c757d,stroke:#565e64,color:#fff
classDef prod fill:#a63123,stroke:#8a2518,color:#fff
A["Pre-commit Gates<br/><small>Lint, Types, Secrets, SAST</small>"]:::prefeature
B["Build + Unit Tests"]:::prefeature
C["Contract + Schema Tests"]:::prefeature
D["Security Scans"]:::parallel
E["Performance Benchmarks"]:::parallel
F["Acceptance Tests<br/><small>Production-Like Env</small>"]:::accept
G["Create Immutable Artifact"]:::ci
H["Deploy Canary / Progressive"]:::prod
I["Health Checks + SLO Monitors<br/>Auto-Rollback"]:::prod
A -->|"commit to trunk"| B
B --> C
C --> D & E
D --> F
E --> F
F --> G
G --> H
H --> I
Key Characteristics
One pipeline, one artifact: The entire application builds and deploys as a single
immutable artifact. There is no fan-out or fan-in.
Linear with parallel branches: Security scans and performance benchmarks run in
parallel because neither depends on the other. Everything else is sequential.
Trunk-based development: All developers commit to trunk at least daily. The pipeline
runs on every commit.
Total target time: Under 15 minutes from commit to production-ready artifact.
Acceptance tests may extend this to 20 minutes for complex applications.
Ownership: The team owns the pipeline definition, which lives in the same repository
as the application code.
When This Architecture Breaks Down
This architecture stops working when:
The system becomes too large for a single team to manage.
Build times extend along with the ability to respond quickly even after optimization
Different parts of the application need different deployment cadences
When these symptoms appear, consider splitting into the
multi-team architecture or decomposing the application into
independently deployable services with their
own pipelines.
Related Content
Quality Gates - the full gate sequence this pipeline applies
Pipeline Architecture - how to evolve pipeline architecture from entangled to loosely coupled
1.2 - Multiple Teams, Single Deployable
A sub-pipeline pattern for multiple teams contributing domain modules to a shared modular monolith.
This architecture suits organizations where multiple teams contribute to a single
deployablemodular monolith - a common
pattern for large applications, mobile apps, or platforms where the final artifact must
be assembled from team contributions.
The modular monolith structure is what makes multi-team ownership possible. Each team
owns a specific module representing a bounded sub-domain of the application. Team A
might own checkout and payments, Team B owns inventory and fulfillment, Team C owns
user accounts and authentication. Modules communicate through explicit internal APIs,
not by reaching into each other’s database tables or calling private methods. Each
team’s sub-pipeline validates only their module. A shared integration pipeline assembles
and verifies the combined result.
This ownership model is critical. Without clear module boundaries, teams step on each
other’s code, sub-pipelines trigger on unrelated changes, and merge conflicts replace
pipeline contention as the bottleneck. The module split must follow the application’s
domain boundaries, not its technical layers. A team that owns “the database layer” or
“the API controllers” will always be coupled to every other team. A team that owns
“payments” can change its database, API, and UI independently. If the codebase is not
yet structured as a modular monolith, restructure it before adopting this architecture
otherwise the sub-pipelines will constantly interfere with each other.
Module ownership by domain: Each team owns a bounded module of the application’s
functionality. Ownership is defined by domain, not by technical layer. The team is
responsible for all code, tests, and pipeline configuration within their module.
Team-owned sub-pipelines: Each team runs their own pre-commit, build, unit test,
contract test, and security gates independently. A team’s sub-pipeline validates only
their module and is their fast feedback loop.
Contract tests at both levels: Teams run contract tests in their sub-pipeline to
catch boundary issues at the module edges. The integration pipeline runs cross-module
contract tests to verify the assembled result.
Integration pipeline is thin: The integration pipeline does not re-run each team’s
tests. It validates only what cannot be validated in isolation - cross-module
integration, the assembled artifact, and end-to-end acceptance tests.
Sub-pipeline target time: Under 10 minutes. This is the team’s primary feedback loop
and must stay fast.
Integration pipeline target time: Under 15 minutes. If it grows beyond this, the
integration test suite needs decomposition or the application needs architectural changes
to enable independent deployment.
Trunk-based development with path filters: All teams commit to the same trunk.
Sub-pipelines trigger based on path filters aligned to module boundaries, so a
change to the payments module does not trigger the inventory sub-pipeline.
Preventing the Integration Pipeline from Becoming a Bottleneck
The integration pipeline is a shared resource and the most likely bottleneck in this
architecture. To keep it fast:
Move tests left into sub-pipelines: Every test that can run in a sub-pipeline should
run there. The integration pipeline should only contain tests that require the full
assembled artifact.
Use contract tests aggressively: Contract tests in sub-pipelines catch most
integration issues without needing the full system. The integration pipeline’s contract
tests are a verification layer, not the primary detection point.
Run the integration pipeline on every commit to trunk: Do not batch. Batching
creates large changesets that are harder to debug when they fail.
Parallelize acceptance tests: Group acceptance tests by feature area and run groups
in parallel.
Monitor integration pipeline duration: Set an alert if it exceeds 15 minutes. Treat
this the same as a failing test - fix it immediately.
When to Move Away from This Architecture
This architecture is a pragmatic pattern for organizations that cannot yet decompose their
monolith into independently deployable services. The long-term goal is
loose coupling -
independent services with independent pipelines that do not need a shared integration step.
Signs you are ready to decompose:
Contract tests catch virtually all integration issues in sub-pipelines
The integration pipeline adds little value beyond what sub-pipelines already verify
Teams are blocked by integration pipeline queuing more than once per week
Different parts of the application need different deployment cadences
Related Content
Quality Gates - the full gate sequence this pipeline applies
Team Alignment to Code - how to structure teams around domain boundaries so this pipeline pattern works
1.3 - Independent Teams, Independent Deployables
A fully independent pipeline pattern for teams deploying their own services in any order, with API contract verification replacing integration testing.
This is the target architecture for continuous delivery at scale. Each team owns an
independently deployable service with its own pipeline, its own release cadence, and
its own path to production. No team waits for another team to deploy. No integration
pipeline serializes their work. The only shared infrastructure is the API contract
layer that defines how services communicate.
This architecture demands disciplined API management. Without it, independent deployment
is an illusion - teams deploy whenever they want, but they break each other constantly.
Fully independent deployment: Each team deploys on its own schedule. Team A can
deploy ten times a day while Team C deploys once a week. No coordination is required.
No shared integration pipeline: There is no fan-in step. Each pipeline goes
straight from artifact creation to production. This eliminates the integration bottleneck
entirely.
Contract tests replace integration tests: Instead of testing all services together,
each team verifies its API contracts independently. The level of contract verification
depends on how much coordination is possible between teams (see
contract verification approaches below).
Each team owns its full pipeline: From pre-commit to production monitoring. No
shared pipeline definitions, no central platform team gating deployments.
Why API Management Is Critical
Independent deployment only works when teams can change their service without breaking
others. This requires a shared understanding of API boundaries that is enforced
automatically, not through meetings or documents that drift.
Without API management, independent pipelines create independent failures. Teams
deploy incompatible changes, discover the breakage in production, and revert to
coordinated releases to stop the bleeding. This is worse than the multi-team architecture
because it creates the illusion of independence while delivering the reliability of chaos.
What API Management Requires
Published API schemas: Every service publishes its API contract (OpenAPI, AsyncAPI,
Protobuf, or equivalent) as a versioned artifact. The schema is the source of truth for
what the service provides.
Contract verification (see approaches below):
At minimum, providers verify backward compatibility against their own published schema.
Where cross-team coordination is feasible, consumer-driven contracts add stronger
guarantees.
Backward compatibility enforcement: Every API change is checked for backward
compatibility against the published schema. Breaking changes require a new API version
using the expand-then-contract pattern:
Deploy the new version alongside the old
Migrate consumers to the new version
Remove the old version only after all consumers have migrated
Schema registry: A central registry (Confluent Schema Registry, a simple artifact
repository, or a Pact Broker where consumer-driven contracts are used) stores published
schemas. Pipelines pull from this registry to run compatibility checks. The registry is
shared infrastructure, but it does not gate deployments - it provides data that each
team’s pipeline uses to make its own go/no-go decision.
API versioning strategy: Teams agree on a versioning convention (URL path versioning,
header versioning, or semantic versioning for message schemas) and enforce it through
pipeline gates. The convention must be simple enough that every team follows it without
deliberation.
Contract Verification Approaches
Not all teams can coordinate on shared contract tooling. The right approach depends on
the relationship between provider and consumer teams. These approaches are listed from
least to most coordination required. Use the strongest approach your context supports.
Approach
How It Works
Coordination Required
Best When
Provider schema compatibility
Provider’s pipeline checks every change for backward compatibility against its own published schema (e.g., OpenAPI diff). No consumer involvement needed.
None between teams
Teams are in different organizations, or consumers are external/unknown
Provider-maintained consumer tests
Provider team writes tests that exercise known consumer usage patterns based on API analytics, documentation, or past breakage.
Minimal - provider observes consumers
Provider can see consumer traffic patterns but cannot require consumer participation
Consumer-driven contracts
Consumers publish pacts describing the subset of the provider API they depend on. Provider runs these pacts in its pipeline. See Contract Tests.
High - shared tooling, broker, and agreement to maintain pacts
Teams are in the same organization with shared tooling and willingness to maintain pacts
Most organizations use a mix. Internal teams with shared tooling can adopt consumer-driven
contracts. Teams consuming third-party or cross-organization APIs use provider schema
compatibility checks and provider-maintained consumer tests.
The critical requirement is not which approach you use but that every provider pipeline
verifies backward compatibility before deployment. The minimum viable contract
verification is an automated schema diff against the published API - if the diff contains
a breaking change, the pipeline fails.
Additional Quality Gates for Distributed Architectures
This architecture is the goal for organizations with:
Multiple teams that need different deployment cadences
Services with well-defined, stable API boundaries
Teams mature enough to own their full delivery pipeline
Investment in contract testing tooling and API governance
When This Architecture Fails
Shared database schemas: Multiple services can share a database engine without
problems. The failure mode is shared schemas - when Service A and Service B both read
from and write to the same tables, a schema migration by one service can break the
other’s queries. Each service must own its own schema. If two services need the same
data, expose it through an API or event, not through direct table access.
Synchronous dependency chains: If Service A calls Service B which calls Service C
in the request path, a deployment of C can break A through B. Circuit breakers and
fallbacks are required at every boundary, and contract tests must cover failure modes,
not just success paths.
No contract verification discipline: If teams skip backward compatibility checks
or let contract test failures slide, breakage shifts from the pipeline to production.
The architecture degrades into uncoordinated deployments with production as the
integration environment. At minimum, every provider must run automated schema
compatibility checks - even without consumer-driven contracts.
Missing observability: When services deploy independently, debugging production
issues requires distributed tracing, correlated logging, and SLO monitoring across
service boundaries. Without this, independent deployment means independent
troubleshooting with no way to trace cause and effect.
Relationship to the Other Architectures
Architecture 3 is where Architecture 2 teams evolve to. The progression is:
The move from 2 to 3 happens incrementally. Extract one service at a time. Give it
its own pipeline. Establish contract tests between it and the monolith. When the contract
tests are reliable, stop running the extracted service’s code through the integration
pipeline. Repeat until the integration pipeline is empty.
Related Content
Quality Gates - the full gate sequence this pipeline applies
A catalog of defect sources across the delivery value stream with earliest detection points, AI shift-left opportunities, and systemic prevention strategies.
Defects do not appear randomly. They originate from specific, predictable sources in the delivery
value stream. This reference catalogs those sources so teams can shift detection left, automate
where possible, and apply AI where it adds real value to the feedback loop.
The goal is systems thinking: detect issues as early as possible in the value stream so feedback informs continuous improvement in how we work, not just reactive fixes to individual defects.
▲ AI shifts detection earlier than current automation alone
Dark cells = current automation is sufficient; AI adds no additional value
No marker = AI assists at the current detection point but does not shift it earlier
How to Use This Catalog
Pick your pain point. Find the category where your team loses the most time to defects or rework. Start there, not at the top.
Focus on the Systemic Prevention column. Automated detection catches defects faster, but systemic prevention eliminates entire categories. Prioritize the prevention fix for each issue you selected.
Measure before and after. Track defect escape rate by category and time-to-detection. If the systemic fix is working, both metrics improve within weeks.
AI adds the most value where detection requires reasoning across multiple signals that existing
tools cannot correlate: ambiguous requirements, undocumented assumptions, semantic code impact,
and knowledge gaps. Where deterministic tools already solve the problem (infrastructure drift,
null safety, branch age), AI adds cost without benefit. Look for the ▲ markers to find the highest-value AI opportunities.
Related Content
ACD - Extend continuous delivery with constraints for AI agent-generated changes
Defects that originate before a single line of code is written - the most expensive category because they compound through every downstream phase.
These defects originate before a single line of code is written. They are the most expensive to
fix because they compound through every downstream phase.
Issue
Earliest Detection (Automation)
Automated Detection
Earlier Detection with AI
Systemic Prevention
Building the wrong thing
Discovery
Product analytics platforms, usage trend alerts
▲ Synthesize user feedback, support tickets, and usage data to surface misalignment earlier than production metrics
Validated user research before backlog entry; dual-track agile
Solving a problem nobody has
Discovery
Support ticket clustering tools, feature adoption tracking
▲ Semantic analysis of interview transcripts, forums, and support tickets to identify real vs. assumed pain
Problem validation as a stage gate; publish problem brief before solution
Correct problem, wrong solution
Discovery
A/B testing frameworks, feature flag cohort comparison
Evaluate prototypes against problem definitions; generate alternative approaches
Prototype multiple approaches; measurable success criteria first
Meets spec but misses user intent
Requirements
Session replay tools, rage-click and error-loop detection
▲ Review acceptance criteria against user behavior data to flag misalignment
Acceptance criteria focused on user outcomes, not checklists
Over-engineering beyond need
Design
Static analysis for dead code and unused abstractions
▲ Flag unnecessary abstraction layers and premature optimization in code review
YAGNI principle; justify every abstraction layer
Prioritizing wrong work
Discovery
DORA metrics versus business outcomes, WSJF scoring
Synthesize roadmap, customer data, and market signals to surface opportunity costs
WSJF prioritization with outcome data
Inaccessible UI excludes users
Pre-commit
axe-core, pa11y, Lighthouse accessibility audits
Current tooling sufficient
WCAG compliance as acceptance criteria; automated accessibility checks in pipeline
Related Content
Defect Sources - full catalog overview and how to use it
Anti-Patterns - patterns that undermine delivery performance
2.2 - Integration & Boundaries Defects
Defects at system boundaries that are invisible to unit tests and often survive until production. Contract testing and deliberate boundary design are the primary defenses.
Defects at system boundaries are invisible to unit tests and often survive until production.
Contract testing and deliberate boundary design are the primary defenses.
Contract Tests - verify that your test doubles still match reality
2.3 - Knowledge & Communication Defects
Defects that emerge from gaps between what people know and what the code expresses - the hardest to detect with automated tools and the easiest to prevent with team practices.
These defects emerge from gaps between what people know and what the code expresses.
They are the hardest to detect with automated tools and the easiest to prevent with team practices.
Issue
Earliest Detection (Automation)
Automated Detection
Earlier Detection with AI
Systemic Prevention
Implicit domain knowledge not in code
Coding
Magic number detection, code ownership analytics
▲ Identify undocumented business rules and knowledge gaps from code and test analysis
Domain-Driven Design with ubiquitous language; embed rules in code
Ambiguous requirements
Requirements
Flag stories without acceptance criteria, BDD spec coverage tracking
▲ Review requirements for ambiguity, missing edge cases, and contradictions; generate test scenarios
Three Amigos before work; example mapping; executable specs
Tribal knowledge loss
Coding
Bus factor analysis from commit history, single-author concentration alerts
▲ Generate documentation from code and tests; flag documentation drift from implementation
Pair/mob programming as default; rotate on-call; living docs
Divergent mental models across teams
Design
Divergent naming detection, contract test failures
▲ Compare terminology and domain models across codebases to detect semantic mismatches
Shared domain models; explicit bounded contexts
Related Content
Defect Sources - full catalog overview and how to use it
Anti-Patterns - patterns that undermine delivery performance
2.4 - Change & Complexity Defects
Defects caused by the act of changing existing code. The larger the change and the longer it lives outside trunk, the higher the risk.
These defects are caused by the act of changing existing code. The larger the change and the
longer it lives outside trunk, the higher the risk.
Anti-Patterns - patterns that undermine delivery performance
2.5 - Testing & Observability Gap Defects
Defects that survive because the safety net has holes. The fix is not more testing - it is better-targeted testing and observability that closes the specific gaps.
These defects survive because the safety net has holes. The fix is not more testing: it is
better-targeted testing and observability that closes the specific gaps.
Anti-Patterns - patterns that undermine delivery performance
2.7 - Data & State Defects
Data defects are particularly dangerous because they can corrupt persistent state. Unlike code defects, data corruption often cannot be fixed by deploying a new version.
Data defects are particularly dangerous because they can corrupt persistent state. Unlike code
defects, data corruption often cannot be fixed by deploying a new version.
Issue
Earliest Detection (Automation)
Automated Detection
Earlier Detection with AI
Systemic Prevention
Schema migration and backward compatibility failures
Security and compliance defects are silent until they are catastrophic. The gap between what the code does and what policy requires is invisible without deliberate, automated verification at every stage.
Security and compliance defects are silent until they are catastrophic. They share a pattern:
the gap between what the code does and what policy requires is invisible without deliberate,
automated verification at every stage.
Anti-Patterns - patterns that undermine delivery performance
2.10 - Performance & Resilience Defects
Performance defects degrade gradually, often hiding behind averages until a threshold tips and the system fails under real load. Detection requires baselines, budgets, and automated enforcement - not periodic manual testing.
Performance defects are rarely binary. They degrade gradually, often hiding behind averages
until a threshold tips and the system fails under real load. Detection requires baselines,
budgets, and automated enforcement - not periodic manual testing.
Concise definitions of the core continuous delivery practices from MinimumCD.
These pages define the minimum practices required for continuous delivery. Each page covers
what the practice is, why it matters, and what the minimum criteria are. For migration
guidance and tactical how-to content, follow the links to the corresponding phase pages.
Integrate work to trunk at least daily with automated testing to maintain a releasable codebase.
Definition
Continuous Integration (CI) is the activity of each developer integrating work to the trunk of version control at least daily and verifying that the work is, to the best of our knowledge, releasable.
CI is not just about tooling - it is fundamentally about team workflow and working agreements.
All changes integrate into a single shared trunk with no intermediate branches.
“Trunk-based development has been shown to be a predictor of high performance in software development and delivery. It is characterized by fewer than three active branches in a code repository; branches and forks having very short lifetimes (e.g., less than a day) before being merged; and application teams rarely or never having ‘code lock’ periods when no one can check in code or do pull requests due to merging conflicts, code freezes, or stabilization phases.”
Accelerate by Nicole Forsgren Ph.D., Jez Humble & Gene Kim
Definition
Trunk-based development (TBD) is a team workflow where changes are integrated into the trunk with no intermediate integration (develop, test, etc.) branch. The two common workflows are making changes directly to the trunk or using very short-lived branches that branch from the trunk and integrate back into the trunk.
Release branches are an intermediate step that some choose on their path to continuous delivery while improving their quality processes in the pipeline. True CD releases from the trunk.
Minimum Activities Required
All changes integrate into the trunk
If branches from the trunk are used:
They originate from the trunk
They re-integrate to the trunk
They are short-lived and removed after the merge
What Is Improved
Smaller changes: TBD emphasizes small, frequent changes that are easier for the team to review and more resistant to impactful merge conflicts. Conflicts become rare and trivial.
We must test: TBD requires us to implement tests as part of the development process.
Better teamwork: We need to work more closely as a team. This has many positive impacts, not least we will be more focused on getting the team’s highest priority done.
Better work definition: Small changes require us to decompose the work into a level of detail that helps uncover things that lack clarity or do not make sense. This provides much earlier feedback on potential quality issues.
Replaces process with engineering: Instead of creating a process where we control the release of features with branches, we can control the release of features with engineering techniques called evolutionary coding methods. These techniques have additional benefits related to stability that cannot be found when replaced by process.
Reduces risk: Long-lived branches carry two common risks. First, the change will not integrate cleanly and the merge conflicts result in broken or lost features. Second, the branch will be abandoned, usually because of the first reason.
Migration Guidance
For detailed guidance on adopting TBD during your CD migration, see:
All deployments flow through one automated pipeline - no exceptions.
Definition
The deployment pipeline is the single, standardized path for all changes to reach any environment - development, testing, staging, or production. No manual deployments, no side channels, no “quick fixes” bypassing the pipeline. If it is not deployed through the pipeline, it does not get deployed.
Key Principles
Single path: All deployments flow through the same pipeline
No exceptions: Even hotfixes and rollbacks go through the pipeline
Automated: Deployment is triggered automatically after pipeline validation
Auditable: Every deployment is tracked and traceable
Consistent: The same process deploys to all environments
What Is Improved
Reliability: Every deployment is validated the same way
Traceability: Clear audit trail from commit to production
Consistency: Environments stay in sync
Speed: Automated deployments are faster than manual
Safety: Quality gates are never bypassed
Confidence: Teams trust that production matches what was tested
Recovery: Rollbacks are as reliable as forward deployments
Migration Guidance
For detailed guidance on establishing a single path to production, see:
Single Path to Production - Phase 2 pipeline practice with anti-patterns, code examples, and getting started steps
The same inputs to the pipeline always produce the same outputs.
Definition
A deterministic pipeline produces consistent, repeatable results. Given the same inputs (code, configuration, dependencies), the pipeline will always produce the same outputs and reach the same pass/fail verdict. The pipeline’s decision on whether a change is releasable is definitive - if it passes, deploy it; if it fails, fix it.
Key Principles
Repeatable: Running the pipeline twice with identical inputs produces identical results
Authoritative: The pipeline is the final arbiter of quality, not humans
Immutable: No manual changes to artifacts or environments between pipeline stages
Trustworthy: Teams trust the pipeline’s verdict without second-guessing
What Makes a Pipeline Deterministic
Version control everything: Source code, IaC, pipeline definitions, test data, dependency lockfiles, tool versions
Lock dependency versions: Always use lockfiles. Never rely on latest or version ranges.
Automated criteria that determine when a change is ready for production.
Definition
The “definition of deployable” is your organization’s agreed-upon set of non-negotiable quality criteria that every artifact must pass before it can be deployed to any environment. This definition should be automated, enforced by the pipeline, and treated as the authoritative verdict on whether a change is ready for deployment.
Key Principles
Pipeline is definitive: If the pipeline passes, the artifact is deployable - no exceptions
Automated validation: All criteria are checked automatically, not manually
Consistent across environments: The same standards apply whether deploying to test or production
Fails fast: The pipeline rejects artifacts that do not meet the standard immediately
What Should Be in Your Definition
Your definition of deployable should include automated checks for:
Accelerate - Nicole Forsgren, Jez Humble, Gene Kim
3.6 - Immutable Artifacts
Build once, deploy everywhere. The artifact is never modified after creation.
Definition
Central to CD is that we are validating the artifact with the pipeline. It is built once and deployed to all environments. A common anti-pattern is building an artifact for each environment. The pipeline should generate immutable, versioned artifacts.
Immutable Pipeline: Failures should be addressed by changes in version control so that two executions with the same configuration always yield the same results. Never go to the failure point, make adjustments in the environment, and re-start from that point.
Immutable Artifacts: Some package management systems allow the creation of release candidate versions. For example, it is common to find -SNAPSHOT versions in Java. However, this means the artifact’s behavior can change without modifying the version. Version numbers are cheap. If we are to have an immutable pipeline, it must produce an immutable artifact. Never use or produce -SNAPSHOT versions.
Immutability provides the confidence to know that the results from the pipeline are real and repeatable.
What Is Improved
Everything must be version controlled: source code, environment configurations, application configurations, and even test data. This reduces variability and improves the quality process.
Confidence in testing: The artifact validated in pre-production is byte-for-byte identical to what runs in production.
Faster rollback: Previous artifacts are unchanged in the artifact repository, ready to be redeployed.
Audit trail: Every artifact is traceable to a specific commit and pipeline run.
Migration Guidance
For detailed guidance on implementing immutable artifacts, see:
Immutable Artifacts - Phase 2 pipeline practice with anti-patterns, good patterns, and getting started steps
Test in environments that mirror production to catch environment-specific issues early.
Definition
It is crucial to leverage pre-production environments in your CD pipeline to run all of your tests (unit, integration, UAT, manual QA, E2E) early and often. Test environments increase interaction with new features and exposure to bugs - both of which are important prerequisites for reliable software.
Types of Pre-Production Environments
Most organizations employ both static and short-lived environments and utilize them for case-specific stages of the SDLC:
Staging environment: The last environment that teams run automated tests against prior to deployment, particularly for testing interaction between all new features after a merge. Its infrastructure reflects production as closely as possible.
Ephemeral environments: Full-stack, on-demand environments spun up on every code change. Each ephemeral environment is leveraged in your pipeline to run E2E, unit, and integration tests on every code change. These environments are defined in version control, created and destroyed automatically on demand. They are short-lived by definition but should closely resemble production. They replace long-lived “static” environments and the maintenance required to keep those stable.
What Is Improved
Infrastructure is kept consistent: Test environments deliver results that reflect real-world performance. Fewer unprecedented bugs reach production since using prod-like data and dependencies allows you to run your entire test suite earlier.
Test against latest changes: These environments rebuild upon code changes with no manual intervention.
Test before merge: Attaching an ephemeral environment to every PR enables E2E testing in your CI before code changes get deployed to staging.
Migration Guidance
For detailed guidance on implementing production-like environments, see:
Production-Like Environments - Phase 2 pipeline practice with environment parity, ephemeral environments, and getting started steps
Rollback on-demand means the ability to quickly and safely revert to a previous working version of your application at any time, without requiring special approval, manual intervention, or complex procedures. It should be as simple and reliable as deploying forward.
Key Principles
Fast: Rollback completes in minutes, not hours. Target < 5 minutes.
Automated: No manual steps or special procedures. Single command or click.
Safe: Rollback is validated just like forward deployment.
Simple: Any team member can execute it without specialized knowledge.
Tested: Rollback mechanism is regularly tested, not just used in emergencies.
What Is Improved
Mean Time To Recovery (MTTR): Drops from hours to minutes
Deployment frequency: Increases due to reduced risk
Team confidence: Higher willingness to deploy
Customer satisfaction: Faster incident resolution
On-call burden: Reduced stress for on-call engineers
Migration Guidance
For detailed guidance on implementing rollback capability, see:
Rollback - Phase 2 pipeline practice with blue-green, canary, feature flag, and database-safe rollback patterns
Separate what varies between environments from what does not.
Definition
Application configuration defines the internal behavior of your application and is bundled with the artifact. It does not vary between environments. This is distinct from environment configuration (secrets, URLs, credentials) which varies by deployment.
Detailed definitions for key delivery metrics. Understand what to measure and why.
These metrics help you assess your current delivery performance and track improvement
over time. Start with the metrics most relevant to your current phase.
How often developers integrate code changes to the trunk. A leading indicator of CI maturity and small batch delivery.
Definition
Integration Frequency measures the average number of production-ready pull requests
a team merges to trunk per day, normalized by team size. On a team of five
developers, healthy continuous integration practice produces at least five
integrations per day, roughly one per developer.
This metric is a direct indicator of how well a team practices
Continuous Integration.
Teams that integrate frequently work in small batches, receive fast feedback, and
reduce the risk associated with large, infrequent merges.
Integration Frequency formula
integrationFrequency = mergedPullRequests / day / numberOfDevelopers
A value of 1.0 or higher per developer per day indicates that work is being
decomposed into small, independently deliverable increments.
How to Measure
Count trunk merges. Track the number of pull requests (or direct commits)
merged to main or trunk each day.
Normalize by team size. Divide the daily count by the number of developers
actively contributing that day.
Calculate the rolling average. Use a 5-day or 10-day rolling window to
smooth daily variation and surface meaningful trends.
Most source control platforms expose this data through their APIs:
GitHub: list merged pull requests via the REST or GraphQL API.
GitLab: query merged merge requests per project.
Bitbucket: use the pull request activity endpoint.
Alternatively, count commits to the default branch if pull requests are not used.
Targets
Level
Integration Frequency (per developer per day)
Low
Less than 1 per week
Medium
A few times per week
High
Once per day
Elite
Multiple times per day
The elite target aligns with trunk-based development, where developers push small
changes to the trunk multiple times daily and rely on automated testing and feature
flags to manage risk.
Common Pitfalls
Meaningless commits. Teams may inflate the count by integrating trivial or
empty changes. Pair this metric with code review quality and defect rate.
Breaking the trunk. Pushing faster without adequate test coverage leads to a
red build and slows the entire team. Always pair Integration Frequency with build
success rate and Change Fail Rate.
Counting the wrong thing. Merges to long-lived feature branches do not count.
Only merges to the trunk or main integration branch reflect true CI practice.
Ignoring quality. If defect rates rise as integration
frequency increases, the team is skipping quality steps. Use defect rate as a
guardrail metric.
Connection to CD
Integration Frequency is the foundational metric for Continuous Delivery. Without
frequent integration, every downstream metric suffers:
Smaller batches reduce risk. Each integration carries less change, making
failures easier to diagnose and fix.
Faster feedback loops. Frequent integration means the CI pipeline runs more
often, catching issues within minutes instead of days.
Enables trunk-based development. High integration frequency is incompatible
with long-lived branches. Teams naturally move toward short-lived branches or
direct trunk commits.
Reduces merge conflicts. The longer code stays on a branch, the more likely
it diverges from trunk. Frequent integration keeps the delta small.
Prerequisite for deployment frequency. You cannot deploy more often than you
integrate. Improving this metric directly unblocks improvements to
Release Frequency.
Time from code commit to a deployable artifact. A critical constraint on feedback speed and mean time to repair.
Definition
Build Duration measures the elapsed time from when a developer pushes a commit
until the CIpipeline produces a deployable artifact and all automated quality
gates have passed. This includes compilation, unit tests, integration tests, static
analysis, security scans, and artifact packaging.
Build Duration represents the minimum possible time between deciding to make a
change and having that change ready for production. It sets a hard floor on
Lead Time and directly constrains how quickly a team can
respond to production incidents.
This metric is sometimes referred to as “pipeline cycle time” or “CI cycle time.”
The book Accelerate references it as part of “hard lead time.”
How to Measure
Record the commit timestamp. Capture when the commit arrives at the CI
server (webhook receipt or pipeline trigger time).
Record the artifact-ready timestamp. Capture when the final pipeline stage
completes successfully and the deployable artifact is published.
Calculate the difference. Subtract the commit timestamp from the
artifact-ready timestamp.
Track the median and p95. The median shows typical performance. The 95th
percentile reveals worst-case builds that block developers.
Most CI platforms expose build duration natively:
GitHub Actions:createdAt and updatedAt on workflow runs.
GitLab CI: pipeline created_at and finished_at.
Jenkins: build start time and duration fields.
CircleCI: workflow duration in the Insights dashboard.
Set up alerts when builds exceed your target threshold so the team can investigate
regressions immediately.
Targets
Level
Build Duration
Low
More than 30 minutes
Medium
10 to 30 minutes
High
5 to 10 minutes
Elite
Less than 5 minutes
The ten-minute threshold is a widely recognized guideline. Builds longer than ten
minutes break developer flow, discourage frequent integration, and increase the
cost of fixing failures.
Common Pitfalls
Removing tests to hit targets. Reducing test count or skipping test types
(integration, security) lowers build duration but degrades quality. Always pair
this metric with Change Fail Rate and defect rate.
Ignoring queue time. If builds wait in a queue before execution, the
developer experiences the queue time as part of the feedback delay even though it
is not technically “build” time. Measure wall-clock time from commit to result.
Optimizing the wrong stage. Profile the pipeline before optimizing. Often a
single slow test suite or a sequential step that could run in parallel dominates
the total duration.
Flaky tests. Tests that intermittently fail cause retries, effectively
doubling or tripling build duration. Track flake rate alongside build duration.
Connection to CD
Build Duration is a critical bottleneck in the Continuous Delivery pipeline:
Constrains Mean Time to Repair. When production is down, the build pipeline
is the minimum time to get a fix deployed. A 30-minute build means at least 30
minutes of downtime for any fix, no matter how small. Reducing build duration
directly improves MTTR.
Enables frequent integration. Developers are unlikely to integrate multiple
times per day if each integration takes 30 minutes to validate. Short builds
encourage higher Integration Frequency.
Shortens feedback loops. The sooner a developer learns that a change broke
something, the less context they have lost and the cheaper the fix. Builds under
ten minutes keep developers in flow.
Supports continuous deployment. Automated deployment pipelines cannot deliver
changes rapidly if the build stage is slow. Build duration is often the largest
component of Lead Time.
To improve Build Duration:
Parallelize stages. Run unit tests, linting, and security scans concurrently
rather than sequentially.
Replace slow end-to-end tests. Move heavyweight end-to-end tests to an
asynchronous post-deploy verification stage. Use contract tests and service
virtualization in the main pipeline.
Decompose large services. Smaller codebases compile and test faster. If build
duration is stubbornly high, consider breaking the service into smaller domains.
Cache aggressively. Cache dependencies, Docker layers, and compilation
artifacts between builds.
Set a build time budget. Alert the team whenever a new test or step pushes
the build past your target, so test efficiency is continuously maintained.
Average time from when work starts until it is running in production. A key flow metric for identifying delivery bottlenecks.
Definition
Development Cycle Time measures the elapsed time from when a developer begins work
on a story or task until that work is deployed to production and available to users.
It captures the full construction phase of delivery: coding, code review, testing,
integration, and deployment.
This is distinct from Lead Time, which includes the time a request
spends waiting in the backlog before work begins. Development Cycle Time focuses
exclusively on the active delivery phase.
The Accelerate research uses “lead time for changes” (measured from commit to
production) as a key DORA metric. Development Cycle Time extends this slightly
further back to when work starts, capturing the full development process including
any time between starting work and the first commit.
How to Measure
Record when work starts. Capture the timestamp when a story moves to
“In Progress” in your issue tracker, or when the first commit for the story
appears.
Record when work reaches production. Capture the timestamp of the
production deployment that includes the completed story.
Calculate the difference. Subtract the start time from the production
deploy time.
Report the median and distribution. The median provides a typical value.
The distribution (or a control chart) reveals variability and outliers that
indicate process problems.
Sources for this data include:
Issue trackers (Jira, GitHub Issues, Azure Boards): status transition
timestamps.
Source control: first commit timestamp associated with a story.
Deployment logs: timestamp of production deployments linked to stories.
Linking stories to deployments is essential. Use commit message conventions (e.g.,
story IDs in commit messages) or deployment metadata to create this connection.
Targets
Level
Development Cycle Time
Low
More than 2 weeks
Medium
1 to 2 weeks
High
2 to 7 days
Elite
Less than 2 days
Elite teams deliver completed work to production within one to two days of starting
it. This is achievable only when work is decomposed into small increments, the
pipeline is fast, and deployment is automated.
Common Pitfalls
Marking work “Done” before it reaches production. If “Done” means “code
complete” rather than “deployed,” the metric understates actual cycle time. The
Definition of Done must include production deployment.
Skipping the backlog. Moving items from “Backlog” directly to “Done” after
deploying hides the true wait time and development duration. Ensure stories pass
through the standard workflow stages.
Splitting work into functional tasks. Breaking a story into separate
“development,” “testing,” and “deployment” tasks obscures the end-to-end cycle
time. Measure at the story or feature level.
Ignoring variability. A low average can hide a bimodal distribution where
some stories take hours and others take weeks. Use a control chart or histogram
to expose the full picture.
Optimizing for speed without quality. If cycle time drops but
Change Fail Rate rises, the team is cutting corners.
Use quality metrics as guardrails.
Connection to CD
Development Cycle Time is the most comprehensive measure of delivery flow and sits
at the heart of Continuous Delivery:
Exposes bottlenecks. A long cycle time reveals where work gets stuck:
waiting for code review, queued for testing, blocked by a manual approval, or
delayed by a slow pipeline. Each bottleneck is a target for improvement.
Drives smaller batches. The only way to achieve a cycle time under two days
is to decompose work into very small increments. This naturally leads to smaller
changes, less risk, and faster feedback.
Reduces waste from changing priorities. Long cycle times mean work in progress
is exposed to priority changes, context switches, and scope creep. Shorter cycles
reduce the window of vulnerability.
Improves feedback quality. The sooner a change reaches production, the sooner
the team gets real user feedback. Short cycle times enable rapid learning and
course correction.
Total time from when a change is committed until it is running in production. A DORA key metric for delivery throughput.
Definition
Lead Time measures the total elapsed time from when a code change is committed to
the version control system until that change is successfully running in production.
This is one of the four key metrics identified by the DORA (DevOps Research and
Assessment) team as a predictor of software delivery performance.
In the broader value stream, “lead time” can also refer to the time from a customer
request to delivery. The DORA definition focuses specifically on the segment from
commit to production, which the Accelerate research calls “lead time for changes.”
This narrower definition captures the efficiency of your delivery pipeline and
deployment process.
Lead Time includes Build Duration plus any additional time
for deployment, approval gates, environment provisioning, and post-deploy
verification. It is a superset of build time and a subset of
Development Cycle Time, which also includes the
coding phase before the first commit.
How to Measure
Record the commit timestamp. Use the timestamp of the commit as recorded in
source control (not the local author timestamp, but the time it was pushed or
merged to the trunk).
Record the production deployment timestamp. Capture when the deployment
containing that commit completes successfully in production.
Calculate the difference. Subtract the commit time from the deploy time.
Aggregate across commits. Report the median lead time across all commits
deployed in a given period (daily, weekly, or per release).
Data sources:
Source control: commit or merge timestamps from Git, GitHub, GitLab, etc.
Pipeline platform: pipeline completion times from Jenkins, GitHub Actions,
GitLab CI, etc.
Deployment tooling: production deployment timestamps from Argo CD, Spinnaker,
Flux, or custom scripts.
For teams practicing continuous deployment, lead time may be nearly identical to
build duration. For teams with manual approval gates or scheduled release windows,
lead time will be significantly longer.
Targets
Level
Lead Time for Changes
Low
More than 6 months
Medium
1 to 6 months
High
1 day to 1 week
Elite
Less than 1 hour
These levels are drawn from the DORA State of DevOps research. Elite performers
deliver changes to production in under an hour from commit, enabled by fully
automated pipelines and continuous deployment.
Common Pitfalls
Measuring only build time. Lead time includes everything after the commit,
not just the CI pipeline. Manual approval gates, scheduled deployment windows,
and environment provisioning delays must all be included.
Ignoring waiting time. A change may sit in a queue waiting for a release
train, a change advisory board (CAB) review, or a deployment window. This wait
time is part of lead time and often dominates the total.
Tracking requests instead of commits. Some teams measure from customer request
to delivery. While valuable, this conflates backlog prioritization with delivery
efficiency. Keep this metric focused on the commit-to-production segment.
Hiding items from the backlog. Requests tracked in spreadsheets or side
channels before entering the backlog distort lead time measurements. Ensure all
work enters the system of record promptly.
Reducing quality to reduce lead time. Shortening approval processes or
skipping test stages reduces lead time at the cost of quality. Pair this metric
with Change Fail Rate as a guardrail.
Connection to CD
Lead Time is one of the four DORA metrics and a direct measure of your delivery
pipeline’s end-to-end efficiency:
Reveals pipeline bottlenecks. A large gap between build duration and lead time
points to manual processes, approval queues, or deployment delays that the team
can target for automation.
Measures the cost of failure recovery. When production breaks, lead time is
the minimum time to deliver a fix (unless you roll back). This makes lead time
a direct input to Mean Time to Repair.
Drives automation. The primary way to reduce lead time is to automate every
step between commit and production: build, test, security scanning, environment
provisioning, deployment, and verification.
Reflects deployment strategy. Teams using continuous deployment have lead
times measured in minutes. Teams using weekly release trains have lead times
measured in days. The metric makes the cost of batching visible.
Connects speed and stability. The DORA research shows that elite performers
achieve both low lead time and low Change Fail Rate.
Speed and quality are not trade-offs. They reinforce each other when the
delivery system is well-designed.
To improve Lead Time:
Automate the deployment pipeline end to end, eliminating manual gates.
Replace change advisory board (CAB) reviews with automated policy checks and
peer review.
Deploy on every successful build rather than batching changes into release trains.
Reduce Build Duration to shrink the largest component of
lead time.
Monitor and eliminate environment provisioning delays.
Percentage of production deployments that cause a failure or require remediation. A DORA key metric for delivery stability.
Definition
Change Fail Rate measures the percentage of deployments to production that result
in degraded service, negative customer impact, or require immediate remediation
such as a rollback, hotfix, or patch.
Requires a hotfix deployed within a short window (commonly 24 hours).
Triggers a production incident attributed to the change.
Requires manual intervention to restore service.
This is one of the four DORA key metrics. It measures the stability side of
delivery performance, complementing the throughput metrics of
Lead Time and Release Frequency.
How to Measure
Count total production deployments over a defined period (weekly, monthly).
Count deployments classified as failures using the criteria above.
Divide failures by total deployments and express as a percentage.
Data sources:
Deployment logs: total deployment count from your CD platform.
Incident management: incidents linked to specific deployments (PagerDuty,
Opsgenie, ServiceNow).
Rollback records: deployments that were reverted, either manually or by
automated rollback.
Hotfix tracking: deployments tagged as hotfixes or emergency changes.
Automate the classification where possible. For example, if a deployment is
followed by another deployment of the same service within a defined window (e.g.,
one hour), flag the original as a potential failure for review.
Targets
Level
Change Fail Rate
Low
46 to 60%
Medium
16 to 45%
High
0 to 15%
Elite
0 to 5%
These levels are drawn from the DORA State of DevOps research. Elite performers
maintain a change fail rate below 5%, meaning fewer than 1 in 20 deployments causes
a problem.
Common Pitfalls
Not recording failures. Deploying fixes without logging the original failure
understates the true rate. Ensure every incident and rollback is tracked.
Reclassifying defects. Creating review processes that reclassify production
defects as “feature requests” or “known limitations” hides real failures.
Inflating deployment count. Re-deploying the same working version to increase
the denominator artificially lowers the rate. Only count deployments that contain
new changes.
Pursuing zero defects at the cost of speed. An obsessive focus on eliminating
all failures can slow Release Frequency to a crawl. A
small failure rate with fast recovery is preferable to near-zero failures with
monthly deployments.
Ignoring near-misses. Changes that cause degraded performance but do not
trigger a full incident are still failures. Define clear criteria for what
constitutes a failed change and apply them consistently.
Connection to CD
Change Fail Rate is the primary quality signal in a Continuous Delivery pipeline:
Validates pipeline quality gates. A rising change fail rate indicates that
the automated tests, security scans, and quality checks in the pipeline are not
catching enough defects. Each failure is an opportunity to add or improve a
quality gate.
Enables confidence in frequent releases. Teams will only deploy frequently
if they trust the pipeline. A low change fail rate builds this trust and
supports higher Release Frequency.
Smaller changes fail less. The DORA research consistently shows that smaller,
more frequent deployments have lower failure rates than large, infrequent
releases. Improving Integration Frequency naturally
improves this metric.
Drives root cause analysis. Each failed change should trigger a blameless
investigation: what automated check could have caught this? The answers feed
directly into pipeline improvements.
Balances throughput metrics. Change Fail Rate is the essential guardrail for
Lead Time and Release Frequency. If
those metrics improve while change fail rate worsens, the team is trading quality
for speed.
To improve Change Fail Rate:
Deploy smaller changes more frequently to reduce the blast radius of failures.
Identify the root cause of each failure and add automated checks to prevent
recurrence.
Strengthen the test suite, particularly integration and contract tests that
validate interactions between services.
Implement progressive delivery (canary releases, feature flags) to limit the
impact of defective changes before they reach all users.
Conduct blameless post-incident reviews and feed learnings back into the
delivery pipeline.
Average time from when a production incident is detected until service is restored. A DORA key metric for recovery capability.
Definition
Mean Time to Repair (MTTR) measures the average elapsed time between when a
production incident is detected and when it is fully resolved and service is
restored to normal operation.
MTTR reflects an organization’s ability to recover from failure. It encompasses
detection, diagnosis, fix development, build, deployment, and verification. A
short MTTR depends on the entire delivery system working well: fast builds,
automated deployments, good observability, and practiced incident response.
The Accelerate research identifies MTTR as one of the four key DORA metrics and
notes that “software delivery performance is a combination of lead time, release
frequency, and MTTR.” It is the stability counterpart to the throughput metrics.
How to Measure
Record the detection timestamp. This is when the team first becomes aware of
the incident, typically when an alert fires, a customer reports an issue, or
monitoring detects an anomaly.
Record the resolution timestamp. This is when the incident is resolved and
service is confirmed to be operating normally. Resolution means the customer
impact has ended, not merely that a fix has been deployed.
Calculate the duration for each incident.
Compute the average across all incidents in a given period.
Data sources:
Incident management platforms: PagerDuty, Opsgenie, ServiceNow, or
Statuspage provide incident lifecycle timestamps.
Monitoring and alerting: alert trigger times from Datadog, Prometheus
Alertmanager, CloudWatch, or equivalent.
Deployment logs: timestamps of rollbacks or hotfix deployments.
Report both the mean and the median. The mean can be skewed by a single long
outage, so the median gives a better sense of typical recovery time. Also track
the maximum MTTR per period to highlight worst-case incidents.
Targets
Level
Mean Time to Repair
Low
More than 1 week
Medium
1 day to 1 week
High
Less than 1 day
Elite
Less than 1 hour
Elite performers restore service in under one hour. This requires automated
rollback or roll-forward capability, fast build pipelines, and well-practiced
incident response processes.
Common Pitfalls
Closing incidents prematurely. Marking an incident as resolved before the
customer impact has actually ended artificially deflates MTTR. Define “resolved”
clearly and verify that service is truly restored.
Not counting detection time. If the team discovers a problem informally
(e.g., a developer notices something odd) and fixes it before opening an
incident, the time is not captured. Encourage consistent incident reporting.
Ignoring recurring incidents. If the same issue keeps reappearing, each
individual MTTR may be short, but the cumulative impact is high. Track recurrence
as a separate quality signal.
Conflating MTTR with MTTD. Mean Time to Detect (MTTD) and Mean Time to
Repair overlap but are distinct. If you only measure from alert to resolution,
you miss the detection gap, the time between when the problem starts and when
it is detected. Both matter.
Optimizing MTTR without addressing root causes. Getting faster at fixing
recurring problems is good, but preventing those problems in the first place is
better. Pair MTTR with Change Fail Rate to ensure the
number of incidents is also decreasing.
Connection to CD
MTTR is a direct measure of how well the entire Continuous Delivery system supports
recovery:
Pipeline speed is the floor. The minimum possible MTTR for a roll-forward
fix is the Build Duration plus deployment time. A 30-minute
build means you cannot restore service via a code fix in less than 30 minutes.
Reducing build duration directly reduces MTTR.
Automated deployment enables fast recovery. Teams that can deploy with one
click or automatically can roll back or roll forward in minutes. Manual
deployment processes add significant time to every incident.
Feature flags accelerate mitigation. If a failing change is behind a feature
flag, the team can disable it in seconds without deploying new code. This can
reduce MTTR from minutes to seconds for flag-protected changes.
Observability shortens detection and diagnosis. Good logging, metrics, and
tracing help the team identify the cause of an incident quickly. Without
observability, diagnosis dominates the repair timeline.
Practice improves performance. Teams that deploy frequently have more
experience responding to issues. High Release Frequency
correlates with lower MTTR because the team has well-rehearsed recovery
procedures.
Trunk-based development simplifies rollback. When trunk is always deployable,
the team can roll back to the previous commit. Long-lived branches and complex
merge histories make rollback risky and slow.
To improve MTTR:
Keep the pipeline always deployable so a fix can be deployed at any time.
How often changes are deployed to production. A DORA key metric for delivery throughput and team capability.
Definition
Release Frequency (also called Deployment Frequency) measures how often a team
successfully deploys changes to production. It is expressed as deployments per day,
per week, or per month, depending on the team’s current cadence.
This is one of the four DORA key metrics. It measures the throughput side of
delivery performance, measuring how rapidly the team can get completed work into the hands
of users. Higher release frequency enables faster feedback, smaller batch sizes,
and reduced deployment risk.
Each deployment should deliver a meaningful change. Re-deploying the same artifact
or deploying empty changes does not count.
How to Measure
Count production deployments. Record each successful deployment to the
production environment over a defined period.
Exclude non-changes. Do not count re-deployments of unchanged artifacts,
infrastructure-only changes (unless relevant), or deployments to non-production
environments.
Calculate frequency. Divide the count by the time period. Express as
deployments per day (for high performers) or per week/month (for teams earlier
in their journey).
Data sources:
CD platforms: Argo CD, Spinnaker, Flux, Octopus Deploy, or similar tools
track every deployment.
Pipeline logs: GitHub Actions, GitLab CI, Jenkins, and CircleCI
record deployment job executions.
Custom deployment scripts: Add a logging line that records the timestamp,
service name, and version to a central log or metrics system.
Targets
Level
Release Frequency
Low
Less than once per 6 months
Medium
Once per month to once per 6 months
High
Once per week to once per month
Elite
Multiple times per day
These levels are drawn from the DORA State of DevOps research. Elite performers
deploy on demand, multiple times per day, with each deployment containing a small
set of changes.
Common Pitfalls
Counting empty deployments. Re-deploying the same artifact or building
artifacts that contain no changes inflates the metric without delivering value.
Count only deployments with meaningful changes.
Ignoring failed deployments. If you count deployments that are immediately
rolled back, the frequency looks good but the quality is poor. Pair with
Change Fail Rate to get the full picture.
Equating frequency with value. Deploying frequently is a means, not an end.
Deploying 10 times a day delivers no value if the changes do not meet user needs.
Release Frequency measures capability, not outcome.
Batch releasing to hit a target. Combining multiple changes into a single
release to deploy “more often” defeats the purpose. The goal is small, individual
changes flowing through the pipeline independently.
Focusing on speed without quality. If release frequency increases but
Change Fail Rate also increases, the team is releasing
faster than its quality processes can support. Slow down and improve the pipeline.
Connection to CD
Release Frequency is the ultimate output metric of a Continuous Delivery pipeline:
Validates the entire delivery system. High release frequency is only possible
when the pipeline is fast, tests are reliable, deployment is automated, and the
team has confidence in the process. It is the end-to-end proof that CD is working.
Reduces deployment risk. Each deployment carries less change when deployments
are frequent. Less change means less risk, easier rollback, and simpler
debugging when something goes wrong.
Enables rapid feedback. Frequent releases get features and fixes in front of
users sooner. This shortens the feedback loop and allows the team to course-correct
before investing heavily in the wrong direction.
Exercises recovery capability. Teams that deploy frequently practice the
deployment process daily. When a production incident occurs, the deployment
process is well-rehearsed and reliable, directly improving
Mean Time to Repair.
Decouples deploy from release. At high frequency, teams separate the act of
deploying code from the act of enabling features for users. Feature flags,
progressive delivery, and dark launches become standard practice.
Number of work items started but not yet completed. A leading indicator of flow problems, context switching, and delivery delays.
Definition
Work in Progress (WIP) is the total count of work items that have been started but
not yet completed and delivered to production. This includes all types of work:
stories, defects, tasks, spikes, and any other items that a team member has begun
but not finished.
Work in Progress formula
wip = countOf(items where status is between "started" and "done")
WIP is a leading indicator from Lean manufacturing. Unlike trailing metrics such as
Development Cycle Time or
Lead Time, WIP tells you about problems that are happening right
now. High WIP predicts future delivery delays, increased cycle time, and lower
quality.
Little’s Law provides the mathematical relationship:
Little’s Law: cycle time as a function of WIP
cycleTime = wip / throughput
If throughput (the rate at which items are completed) stays constant, increasing WIP
directly increases cycle time. The only way to reduce cycle time without working
faster is to reduce WIP.
How to Measure
Count all in-progress items. At a regular cadence (daily or at each standup),
count the number of items in any active state on your team’s board. Include
everything between “To Do” and “Done.”
Normalize by team size. Divide WIP by the number of team members to get a
per-person ratio. This makes the metric comparable across teams of different sizes.
Track over time. Record the WIP count daily and observe trends. A rising WIP
count is an early warning of delivery problems.
Data sources:
Kanban boards: Jira, Azure Boards, Trello, GitHub Projects, or physical
boards. Count cards in any column between the backlog and done.
Issue trackers: Query for items with an “In Progress,” “In Review,”
“In QA,” or equivalent active status.
Manual count: At standup, ask: “How many things are we actively working on
right now?”
The simplest and most effective approach is to make WIP visible by keeping the team
board up to date and counting active items daily.
Targets
Level
WIP per Team
Low
More than 2x team size
Medium
Between 1x and 2x team size
High
Equal to team size
Elite
Less than team size (ideally half)
The guiding principle is that WIP should never exceed team size. A team of five
should have at most five items in progress at any time. Elite teams often work
in pairs, bringing WIP to roughly half the team size.
Common Pitfalls
Hiding work. Not moving items to “In Progress” when working on them keeps
WIP artificially low. The board must reflect reality. If someone is working on
it, it should be visible.
Marking items done prematurely. Moving items to “Done” before they are
deployed to production understates WIP. The Definition of Done must include
production deployment.
Creating micro-tasks. Splitting a single story into many small tasks
(development, testing, code review, deployment) and tracking each separately
inflates the item count without changing the actual work. Measure WIP at the
story or feature level.
Ignoring unplanned work. Production support, urgent requests, and
interruptions consume capacity but are often not tracked on the board. If the
team is spending time on it, it is WIP and should be visible.
Setting WIP limits but not enforcing them. WIP limits only work if the team
actually stops starting new work when the limit is reached. Treat WIP limits as
a hard constraint, not a suggestion.
Connection to CD
WIP is the most actionable flow metric and directly impacts every aspect of
Continuous Delivery:
Predicts cycle time. Per Little’s Law, WIP and cycle time are directly
proportional. Reducing WIP is the fastest way to reduce
Development Cycle Time without changing anything
else about the delivery process.
Reduces context switching. When developers juggle multiple items, they lose
time switching between contexts. Research consistently shows that each additional
item in progress reduces effective productivity. Low WIP means more focus and
faster completion.
Exposes blockers. When WIP limits are in place and an item gets blocked, the
team cannot simply start something new. They must resolve the blocker first. This
forces the team to address systemic problems rather than working around them.
Enables continuous flow. CD depends on a steady flow of small changes moving
through the pipeline. High WIP creates irregular, bursty delivery. Low WIP
creates smooth, predictable flow.
Improves quality. When teams focus on fewer items, each item gets more
attention. Code reviews happen faster, testing is more thorough, and defects are
caught sooner. This naturally reduces Change Fail Rate.
Supports trunk-based development. High WIP often correlates with many
long-lived branches. Reducing WIP encourages developers to complete and integrate
work before starting something new, which aligns with
Integration Frequency goals.
To reduce WIP:
Set explicit WIP limits for the team and enforce them. Start with a limit equal
to team size and reduce it over time.
Prioritize finishing work over starting new work. At standup, ask “What can I
help finish?” before “What should I start?”
Prioritize code review and pairing to unblock teammates over picking up new items.
Make the board visible and accurate. Use it as the single source of truth for
what the team is working on.
Identify and address recurring blockers that cause items to stall in progress.
Test architecture, types, and best practices for building confidence in your delivery pipeline.
A reliable test suite is essential for continuous delivery. This page describes the test
architecture that gives your pipeline the confidence to deploy any change - even when
dependencies outside your control are unavailable. The child pages cover each test type
in detail.
Beyond the Test Pyramid
The test pyramid - many unit tests at the base, fewer integration tests in the middle, a handful
of end-to-end tests at the top - has been the dominant mental model for test strategy since Mike
Cohn introduced it. The core insight is sound: push testing as low as possible. Lower-level
tests are faster, more deterministic, and cheaper to maintain. Higher-level tests are slower,
more brittle, and more expensive.
But as a prescriptive model, the pyramid is overly simplistic. Teams that treat it as a rigid
ratio end up in unproductive debates about whether they have “too many” integration tests or “not
enough” unit tests. The shape of your test distribution matters far less than whether your tests,
taken together, give you the confidence to deploy.
What actually matters
The pyramid’s principle - write tests with different granularity - remains correct. But for
CD, the question is not “do we have the right pyramid shape?” The question is:
Can our pipeline determine that a change is safe to deploy without depending on any system we
do not control?
This reframes the testing conversation. Instead of counting tests by type and trying to match a
diagram, you design a test architecture where:
Fast, deterministic tests catch the vast majority of defects and run on every commit.
These tests use test doubles for anything outside
the team’s control. They give you a reliable go/no-go signal in minutes.
Contract tests verify that your test doubles still match reality. They run asynchronously
and catch drift between your assumptions and the real world - without blocking your pipeline.
A small number of non-deterministic tests validate that the fully integrated system works.
These run post-deployment and provide monitoring, not gating.
This structure means your pipeline can confidently say “yes, deploy this” even if a downstream
API is having an outage, a third-party service is slow, or a partner team hasn’t deployed their
latest changes yet. Your ability to deliver is decoupled from the reliability of systems you do
not own.
The anti-pattern: the ice cream cone
Most teams that struggle with CD have an inverted test distribution - too many slow, expensive
end-to-end tests and too few fast, focused tests.
The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests
take hours, fail randomly, and depend on external systems being healthy. The pipeline cannot give
a fast, reliable answer about deployability, so deployments become high-ceremony events.
Test Architecture
A test architecture is the deliberate structure of how different test types work together across
your pipeline to give you deployment confidence. Each layer has a specific role, and the layers
reinforce each other.
Verify complete user journeys through the fully integrated system
No
Monitoring, not gating - runs post-deployment
Static Analysis runs alongside layers 1-3, catching code quality, security, and
style issues without executing the code. Test Doubles are used throughout
layers 1-3 to isolate external dependencies.
How the layers work together
Test layers by pipeline stage
Pipeline stage Test layer Deterministic? Blocks deploy?
─────────────────────────────────────────────────────────────────────────
On every commit Unit tests Yes Yes
Integration tests Yes Yes
Functional tests Yes Yes
Asynchronous Contract tests No No (triggers review)
Post-deployment E2E smoke tests No Triggers rollback if critical
Synthetic monitoring No Triggers alerts
The critical insight: everything that blocks deployment is deterministic and under your
control. Everything that involves external systems runs asynchronously or post-deployment. This
is what gives you the independence to deploy any time, regardless of the state of the world
around you.
Pre-merge vs post-merge
The table above maps to two distinct phases of your pipeline, each with different goals and
constraints.
Pre-merge (before code lands on trunk): Run unit, integration, and functional tests. These
must all be deterministic and fast. Target: under 10 minutes total. This is the quality gate that
every change must pass. If pre-merge tests are slow, developers batch up changes or skip local
runs, both of which undermine continuous integration.
Post-merge (after code lands on trunk, before or after deployment): Re-run the full
deterministic suite against the integrated trunk to catch merge-order interactions. Run contract
tests, E2E smoke tests, and synthetic monitoring. Target: under 30 minutes for the full
post-merge cycle.
Why re-run pre-merge tests post-merge? Two changes can each pass pre-merge independently but
conflict when combined on trunk. The post-merge run catches these integration effects. If a
post-merge failure occurs, the team fixes it immediately - trunk must always be releasable.
Testing Matrix
Use this reference to decide what type of test to write and where it runs in your pipeline.
Run tests on every commit. If tests do not run automatically, they will be skipped.
Keep the deterministic suite under 10 minutes. If it is slower, developers will stop
running it locally.
Fix broken tests immediately. A broken test is equivalent to a broken build.
Delete tests that do not provide value. A test that never fails and tests trivial behavior
is maintenance cost with no benefit.
Test behavior, not implementation. Use a
black box approach - verify what the code
does, not how it does it. As Ham Vocke advises: “if I enter values x and y, will the
result be z?” - not the sequence of internal calls that produce z. Avoid
white box testing that asserts on internals.
Use test doubles for external dependencies. Your deterministic tests should run without
network access to external systems.
Validate test doubles with contract tests. Test doubles that drift from reality give false
confidence.
Treat test code as production code. Give it the same care, review, and refactoring
attention.
Run automated accessibility checks on every commit. WCAG compliance scans are fast,
deterministic, and catch violations that are invisible to sighted developers. Treat them
like security scans: automate the detectable rules and reserve manual review for
subjective judgment.
Do Not
Do not tolerate flaky tests. Quarantine or delete them immediately.
Do not gate your pipeline on non-deterministic tests. E2E and contract test failures
should trigger review or alerts, not block deployment.
Do not couple your deployment to external system availability. If a third-party API being
down prevents you from deploying, your test architecture has a critical gap.
Do not write tests after the fact as a checkbox exercise. Tests written without
understanding the behavior they verify add noise, not value.
Do not test private methods directly. Test the public interface; private methods are tested
indirectly.
Do not share mutable state between tests. Each test should set up and tear down its own
state.
Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or
event-driven assertions.
Do not require a running database or external service for unit tests. That makes them
integration tests - which is fine, but categorize them correctly.
Fast, deterministic tests that verify a unit of behavior through its public interface, asserting on what the code does rather than how it works.
Definition
A unit test is a deterministic test that exercises a unit of behavior (a single
meaningful action or decision your code makes) and verifies that the observable outcome is
correct. The “unit” is not a function, method, or class. It is a behavior: given these inputs,
the system produces this result. A single behavior may involve one function or several
collaborating objects. What matters is that the test treats the code as a
black box and asserts only on what it produces,
not on how it produces it.
All external dependencies are replaced with test doubles so the test runs
quickly and produces the same result every time.
White box testing (asserting on internal method
calls, call order, or private state) creates change-detector tests that break during routine
refactoring without catching real defects. Prefer testing through the public interface (methods,
APIs, exported functions) and asserting on return values, state changes visible to consumers,
or observable side effects.
The purpose of unit tests is to:
Verify that a unit of behavior produces the correct observable outcome.
Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
Keep cyclomatic complexity visible and manageable through good separation of concerns.
When to Use
During development: run the relevant subset of unit tests continuously while writing
code. TDD (Red-Green-Refactor) is the most effective workflow.
On every commit: use pre-commit hooks or watch-mode test runners so broken tests never
reach the remote repository.
In CI: execute the full unit test suite on every pull request and on the trunk after
merge to verify nothing was missed locally.
Unit tests are the right choice when the behavior under test can be exercised without network
access, file system access, or database connections. If you need any of those, you likely need
an integration test or a functional test instead.
Characteristics
Property
Value
Speed
Milliseconds per test
Determinism
Always deterministic
Scope
A single unit of behavior
Dependencies
All replaced with test doubles
Network
None
Database
None
Breaks build
Yes
Examples
A JavaScript unit test verifying a pure utility function:
JavaScript unit test for castArray utility
// castArray.test.jsdescribe("castArray",()=>{it("should wrap non-array items in an array",()=>{expect(castArray(1)).toEqual([1]);expect(castArray("a")).toEqual(["a"]);expect(castArray({a:1})).toEqual([{a:1}]);});it("should return array values by reference",()=>{const array =[1];expect(castArray(array)).toBe(array);});it("should return an empty array when no arguments are given",()=>{expect(castArray()).toEqual([]);});});
A Java unit test using Mockito to isolate the system under test:
Java unit test with Mockito stub isolating the controller
White box testing: asserting on internal
state, call order, or private method behavior rather than observable output. These
change-detector tests break during refactoring without catching real defects. Test through
the public interface instead.
Testing private methods: private implementations are meant to change. They are
exercised indirectly through the behavior they support. Test the public interface instead.
No assertions: a test that runs code without asserting anything provides false
confidence. Lint rules can catch this automatically.
Disabling or skipping tests: skipped tests erode confidence over time. Fix or remove
them.
Confusing “unit” with “function”: a unit of behavior may span multiple collaborating
objects. Forcing one-test-per-function creates brittle tests that mirror the implementation
structure rather than verifying meaningful outcomes.
Ice cream cone testing: relying primarily on slow E2E tests while neglecting fast unit
tests inverts the test pyramid and slows feedback.
Chasing coverage numbers: gaming coverage metrics (e.g., running code paths without
meaningful assertions) creates a false sense of confidence. Focus on behavior coverage
instead.
Connection to CD Pipeline
Unit tests occupy the base of the test pyramid. They run in the earliest stages of the
CD pipeline and provide the fastest feedback loop:
Local development: watch mode reruns tests on every save.
Pre-commit: hooks run the suite before code reaches version control.
PR verification: CI runs the full suite and blocks merge on failure.
Trunk verification: CI reruns tests on the merged HEAD to catch integration issues.
Because unit tests are fast and deterministic, they should always break the build on failure.
A healthy CD pipeline depends on a large, reliable suite of
black box unit tests that verify behavior
rather than implementation, giving developers the confidence to refactor freely and ship
small changes frequently.
Deterministic tests that verify how units interact together or with external system boundaries using test doubles for non-deterministic dependencies.
Definition
An integration test is a deterministic test that verifies how the unit under test interacts
with other units without directly accessing external sub-systems. It may validate multiple
units working together (sometimes called a “sociable unit test”) or the portion of the code
that interfaces with an external network dependency while using a test double to represent
that dependency.
For clarity: an “integration test” is not a test that broadly integrates multiple
sub-systems. That is an end-to-end test.
When to Use
Integration tests provide the best balance of speed, confidence, and cost. Use them when:
You need to verify that multiple units collaborate correctly (for example, a service
calling a repository that calls a data mapper).
You need to validate the interface layer to an external system (HTTP client, message
producer, database query) while keeping the external system replaced by a test double.
You want to confirm that a refactoring did not break behavior. Integration tests that
avoid testing implementation details survive refactors without modification.
You are building a front-end component that composes child components and needs to verify
the assembled behavior from the user’s perspective.
If the test requires a live network call to a system outside localhost, it is either a
contract test or an E2E test.
Characteristics
Property
Value
Speed
Milliseconds to low seconds
Determinism
Always deterministic
Scope
Multiple units or a unit plus its boundary
Dependencies
External systems replaced with test doubles
Network
Localhost only
Database
Localhost / in-memory only
Breaks build
Yes
Examples
A JavaScript integration test verifying that a connector returns structured data:
Integration test - connector returning structured data
describe("retrieving Hygieia data",()=>{it("should return counts of merged pull requests per day",async()=>{const result =await hygieiaConnector.getResultsByDay(
hygieiaConfigs.integrationFrequencyRoute,
testTeam,
startDate,
endDate
);expect(result.status).toEqual(200);expect(result.data).toBeInstanceOf(Array);expect(result.data[0]).toHaveProperty("value");expect(result.data[0]).toHaveProperty("dateStr");});it("should return an empty array if the team does not exist",async()=>{const result =await hygieiaConnector.getResultsByDay(
hygieiaConfigs.integrationFrequencyRoute,0,
startDate,
endDate
);expect(result.data).toEqual([]);});});
Subcategories
Service integration tests validate how the system under test responds to information
from an external service. Use virtual services or static mocks; pair with
contract tests to keep the doubles current.
Database integration tests validate query logic against a controlled data store. Prefer
in-memory databases, isolated DB instances, or personalized datasets over shared live data.
Front-end integration tests render the component tree and interact with it the way a
user would. Follow the accessibility order of operations for element selection: visible text
and labels first, ARIA roles second, test IDs only as a last resort.
Anti-Patterns
Peeking behind the curtain: using tools that expose component internals (e.g.,
Enzyme’s instance() or state()) instead of testing from the user’s perspective.
Mocking too aggressively: replacing every collaborator turns an integration test into a
unit test and removes the value of testing real interactions. Only mock what is necessary to
maintain determinism.
Testing implementation details: asserting on internal state, private methods, or call
counts rather than observable output.
Introducing a test user: creating an artificial actor that would never exist in
production. Write tests from the perspective of a real end-user or API consumer.
Tolerating flaky tests: non-deterministic integration tests erode trust. Fix or remove
them immediately.
Duplicating E2E scope: if the test integrates multiple deployed sub-systems with live
network calls, it belongs in the E2E category, not here.
Connection to CD Pipeline
Integration tests form the largest portion of a healthy test suite (the “trophy” or the
middle of the pyramid). They run alongside unit tests in the earliest CI stages:
Local development: run in watch mode or before committing.
PR verification: CI executes the full suite; failures block merge.
Trunk verification: CI reruns on the merged HEAD.
Because they are deterministic and fast, integration tests should always break the build.
A team whose refactors break many tests likely has too few integration tests and too many
fine-grained unit tests. As Kent C. Dodds advises: “Write tests, not too many, mostly
integration.”
Deterministic tests that verify all modules of a sub-system work together from the actor’s perspective, using test doubles for external dependencies.
Definition
A functional test is a deterministic test that verifies all modules of a sub-system are
working together. It introduces an actor (typically a user interacting with the UI or a
consumer calling an API) and validates the ingress and egress of that actor within the
system boundary. External sub-systems are replaced with test doubles to
keep the test deterministic.
Functional tests cover broad-spectrum behavior: UI interactions, presentation logic, and
business logic flowing through the full sub-system. They differ from
end-to-end tests in that side effects are mocked and never cross boundaries
outside the system’s control.
Functional tests are sometimes called component tests. Martin Fowler calls them
sociable unit tests
to distinguish them from solitary unit tests that stub all collaborators: a sociable test
allows real collaborators within the sub-system boundary while still replacing external
dependencies with test doubles.
When to Use
You need to verify a complete user-facing feature from input to output within a single
deployable unit (e.g., a service or a front-end application).
You want to test how the UI, business logic, and data layers interact without depending
on live external services.
You need to simulate realistic user workflows (filling in forms, navigating pages,
submitting API requests) while keeping the test fast and repeatable.
You are validating acceptance criteria for a user story and want a test that maps
directly to the specified behavior.
You need to verify keyboard navigation, focus management, and screen reader
announcements as part of feature verification. Accessibility behavior is user-facing
behavior and belongs in functional tests.
If the test needs to reach a live external dependency, it is an E2E test. If it
tests a single unit in isolation, it is a unit test.
A functional test for a REST API using an in-process server and mocked downstream services:
REST API functional test - order creation with mocked inventory service
describe("POST /orders",()=>{it("should create an order and return 201",async()=>{// Arrange: mock the inventory service responsenock("https://inventory.internal").get("/stock/item-42").reply(200,{available:true,quantity:10});// Act: send a request through the full application stackconst response =awaitrequest(app).post("/orders").send({itemId:"item-42",quantity:2});// Assert: verify the user-facing responseexpect(response.status).toBe(201);expect(response.body.orderId).toBeDefined();expect(response.body.status).toBe("confirmed");});it("should return 409 when inventory is insufficient",async()=>{nock("https://inventory.internal").get("/stock/item-42").reply(200,{available:true,quantity:0});const response =awaitrequest(app).post("/orders").send({itemId:"item-42",quantity:2});expect(response.status).toBe(409);expect(response.body.error).toMatch(/insufficient/i);});});
A front-end functional test exercising a login flow with a mocked auth service:
Front-end functional test - login flow with mocked auth service
describe("Login page",()=>{it("should redirect to the dashboard after successful login",async()=>{
mockAuthService.login.mockResolvedValue({token:"abc123"});render(<App />);await userEvent.type(screen.getByLabelText("Email"),"ada@example.com");await userEvent.type(screen.getByLabelText("Password"),"s3cret");await userEvent.click(screen.getByRole("button",{name:"Sign in"}));expect(await screen.findByText("Dashboard")).toBeInTheDocument();});});
Accessibility Verification
Functional tests already exercise the UI from the actor’s perspective, making them the
natural place to verify that interactions work for all users. Accessibility assertions
fit alongside existing functional assertions rather than in a separate test suite.
A functional test verifying keyboard-only interaction and running axe-core assertions
against the rendered page:
Accessibility functional test - keyboard navigation and axe-core WCAG assertions
import{ axe, toHaveNoViolations }from"jest-axe";
expect.extend(toHaveNoViolations);describe("Checkout flow",()=>{it("should be completable using only the keyboard",async()=>{render(<CheckoutPage />);// Navigate to the first form field using Tabawait userEvent.tab();expect(screen.getByLabelText("Card number")).toHaveFocus();// Fill in the form using keyboard onlyawait userEvent.type(screen.getByLabelText("Card number"),"4111111111111111");await userEvent.tab();await userEvent.type(screen.getByLabelText("Expiry"),"12/27");await userEvent.tab();// Submit with Enterawait userEvent.keyboard("{Enter}");expect(await screen.findByText("Order confirmed")).toBeInTheDocument();// Verify no accessibility violations in the final stateconst results =awaitaxe(document.body);expect(results).toHaveNoViolations();});});
Anti-Patterns
Using live external services: this makes the test non-deterministic and slow. Use test
doubles for anything outside the sub-system boundary.
Testing through the database: sharing a live database between tests introduces ordering
dependencies and flakiness. Use in-memory databases or mocked data layers.
Ignoring the actor’s perspective: functional tests should interact with the system the
way a user or consumer would. Reaching into internal APIs or bypassing the UI defeats the
purpose.
Duplicating unit test coverage: functional tests should focus on feature-level behavior
and happy/critical paths, not every edge case. Leave permutation testing to unit tests.
Slow test setup: if spinning up the sub-system takes too long, invest in faster
bootstrapping (in-memory stores, lazy initialization) rather than skipping functional tests.
Deferring accessibility testing to a manual audit phase: accessibility defects caught
in a quarterly audit are weeks or months old. Automated WCAG checks in functional tests
catch violations on every commit, just like any other regression.
Connection to CD Pipeline
Functional tests run after unit and integration tests in the pipeline, typically as part of
the same CI stage:
Pre-commit: functional tests run locally before every commit. Because they are
deterministic and scoped to the sub-system, they are fast enough to give immediate
feedback without slowing the development loop.
PR verification: functional tests run in CI against the sub-system in isolation,
giving confidence that the feature works before merge.
Trunk verification: the same tests run on the merged HEAD to catch conflicts.
Pre-deployment gate: functional tests can serve as the final deterministic gate before
a build artifact is promoted to a staging environment.
Because functional tests are deterministic, they should break the build on failure.
They are more expensive than unit and integration tests, so teams should focus on
happy-path and critical-path scenarios while keeping the total count manageable.
Non-deterministic tests that validate the entire software system along with its integration with external interfaces and production-like scenarios.
Definition
End-to-end (E2E) tests validate the entire software system, including its integration with
external interfaces. They exercise complete production-like scenarios using real (or
production-like) data and environments to simulate real-time settings. No test doubles are
used. The test hits live services, databases, and third-party integrations just as a real
user would.
Because they depend on external systems, E2E tests are typically non-deterministic: they
can fail for reasons unrelated to code correctness, such as network instability or
third-party outages.
When to Use
E2E tests should be the least-used test type due to their high cost in execution time and
maintenance. Use them for:
Happy-path validation of critical business flows (e.g., user signup, checkout, payment
processing).
Smoke testing a deployed environment to verify that key integrations are functioning.
Cross-team workflows that span multiple sub-systems and cannot be tested any other way.
Do not use E2E tests to cover edge cases, error handling, or input validation. Those
scenarios belong in unit, integration, or
functional tests.
Vertical vs. Horizontal E2E Tests
Vertical E2E tests target features under the control of a single team:
Favoriting an item and verifying it persists across refresh.
Creating a saved list and adding items to it.
Horizontal E2E tests span multiple teams:
Navigating from the homepage through search, item detail, cart, and checkout.
Horizontal tests are significantly more complex and fragile. Due to their large failure
surface area, they are not suitable for blocking release pipelines.
Characteristics
Property
Value
Speed
Seconds to minutes per test
Determinism
Typically non-deterministic
Scope
Full system including external integrations
Dependencies
Real services, databases, third-party APIs
Network
Full network access
Database
Live databases
Breaks build
Generally no (see guidance below)
Examples
A vertical E2E test verifying user lookup through a live web interface:
Vertical E2E test - user lookup via live web interface
@TestpublicvoidverifyValidUserLookup()throwsException{// Act -- interact with the live application
homePage.getUserData("validUserId");waitForElement(By.xpath("//span[@id='name']"));// Assert -- verify real data returned from the live backendassertEquals("Ada Lovelace", homePage.getName());assertEquals("Engineering", homePage.getOrgName());assertEquals("Grace Hopper", homePage.getManagerName());}
A browser-based E2E test using a tool like Playwright:
Browser-based E2E test - add to cart and checkout with Playwright
test("user can add an item to cart and check out",async({ page })=>{await page.goto("https://staging.example.com");await page.getByRole("link",{name:"Running Shoes"}).click();await page.getByRole("button",{name:"Add to Cart"}).click();await page.getByRole("link",{name:"Cart"}).click();awaitexpect(page.getByText("Running Shoes")).toBeVisible();await page.getByRole("button",{name:"Checkout"}).click();awaitexpect(page.getByText("Order confirmed")).toBeVisible();});
Anti-Patterns
Using E2E tests as the primary safety net: this is the “ice cream cone” anti-pattern.
E2E tests are slow and fragile; the majority of your confidence should come from unit and
integration tests.
Blocking the pipeline with horizontal E2E tests: these tests span too many teams and
failure surfaces. Run them asynchronously and review failures out of band.
Ignoring flaky failures: E2E tests often fail for environmental reasons. Track the
frequency and root cause of failures. If a test is not providing signal, fix it or remove
it.
Testing edge cases in E2E: exhaustive input validation and error-path testing should
happen in cheaper, faster test types.
Not capturing failure context: E2E failures are expensive to debug. Capture
screenshots, network logs, and video recordings automatically on failure.
Connection to CD Pipeline
E2E tests run in the later stages of the delivery pipeline, after the build artifact has
passed all deterministic tests and has been deployed to a staging or pre-production
environment:
Post-deployment smoke tests: a small, fast suite of vertical E2E tests verifies that
the deployment succeeded and critical paths work.
Scheduled regression suites: broader E2E suites (including horizontal tests) run on a
schedule rather than on every commit.
Production monitoring: customer experience alarms (synthetic monitoring) are a form of
continuous E2E testing that runs in production.
Because E2E tests are non-deterministic, they should not break the build in most cases. A
team may choose to gate on a small set of highly reliable vertical E2E tests, but must invest
in reducing false positives to make this valuable. CD pipelines should be optimized for rapid
recovery of production issues rather than attempting to prevent all defects with slow,
fragile E2E gates.
Non-deterministic tests that validate test doubles by verifying API contract format against live external systems.
Definition
A contract test validates that the test doubles used in
integration tests still accurately represent the real external system.
Contract tests run against the live external sub-system and exercise the portion of the
code that interfaces with it. Because they depend on live services, contract tests are
non-deterministic and should not break the build. Instead, failures should trigger a
review to determine whether the contract has changed and the test doubles need updating.
A contract test validates contract format, not specific data. It verifies that response
structures, field names, types, and status codes match expectations, not that particular
values are returned.
Contract tests have two perspectives:
Provider: the team that owns the API verifies that all changes are backwards compatible
(unless a new API version is introduced). Every build should validate the provider contract.
Consumer: the team that depends on the API verifies that they can still consume the
properties they need, following
Postel’s Law: “Be conservative in
what you do, be liberal in what you accept from others.”
When to Use
You have integration tests that use test doubles (mocks, stubs, recorded
responses) to represent external services, and you need assurance those doubles remain
accurate.
You consume a third-party or cross-team API that may change without notice.
You provide an API to other teams and want to ensure that your changes do not break their
expectations (consumer-driven contracts).
You are adopting contract-driven development, where contracts are defined during design
so that provider and consumer teams can work in parallel using shared mocks and fakes.
Characteristics
Property
Value
Speed
Seconds (depends on network latency)
Determinism
Non-deterministic (hits live services)
Scope
Interface boundary between two systems
Dependencies
Live external sub-system
Network
Yes (calls the real dependency)
Database
Depends on the provider
Breaks build
No (failures trigger review, not build failure)
Examples
A provider contract test verifying that an API response matches the expected schema:
Provider contract test - schema validation
describe("GET /users/:id contract",()=>{it("should return a response matching the user schema",async()=>{const response =awaitfetch("https://api.partner.com/users/1");const body =await response.json();// Validate structure, not specific dataexpect(response.status).toBe(200);expect(body).toHaveProperty("id");expect(typeof body.id).toBe("number");expect(body).toHaveProperty("name");expect(typeof body.name).toBe("string");expect(body).toHaveProperty("email");expect(typeof body.email).toBe("string");});});
A consumer-driven contract test using Pact:
Consumer-driven contract test with Pact
describe("Order Service - Inventory Provider Contract",()=>{it("should receive stock availability in the expected format",async()=>{// Define the expected interactionawait provider.addInteraction({state:"item-42 is in stock",uponReceiving:"a request for item-42 stock",withRequest:{method:"GET",path:"/stock/item-42"},willRespondWith:{status:200,body:{available: Matchers.boolean(true),quantity: Matchers.integer(10),},},});// Exercise the consumer code against the mock providerconst result =await inventoryClient.checkStock("item-42");expect(result.available).toBe(true);});});
Anti-Patterns
Using contract tests to validate business logic: contract tests verify structure and
format, not behavior. Business logic belongs in functional tests.
Breaking the build on contract test failure: because these tests hit live systems,
failures may be caused by network issues or temporary outages, not actual contract changes.
Treat failures as signals to investigate.
Neglecting to update test doubles: when a contract test fails because the upstream API
changed, the test doubles in your integration tests must be updated to match. Ignoring
failures defeats the purpose.
Running contract tests too infrequently: the frequency should be proportional to the
volatility of the interface. Highly active APIs need more frequent contract validation.
Testing specific data values: asserting that name equals "Alice" makes the test
brittle. Assert on types, required fields, and response codes instead.
Connection to CD Pipeline
Contract tests run asynchronously from the main CI build, typically on a schedule:
Provider side: provider contract tests (schema validation, response code checks) are
often implemented as deterministic unit tests and run on every commit as part of the
provider’s CI pipeline.
Consumer side: consumer contract tests run on a schedule (e.g., hourly or daily)
against the live provider. Failures are reviewed and may trigger updates to test doubles
or conversations between teams.
Consumer-driven contracts: when using tools like Pact, the consumer publishes
contract expectations and the provider runs them continuously. Both teams communicate when
contracts break.
Contract tests are the bridge that keeps your fast, deterministic integration test suite
honest. Without them, test doubles can silently drift from reality, and your integration
tests provide false confidence.
Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.
Definition
Static analysis (also called static testing) evaluates non-running code against rules for
known good practices. Unlike other test types that execute code and observe behavior, static
analysis inspects source code, configuration files, and dependency manifests to detect
problems before the code ever runs.
Static analysis serves several key purposes:
Catches errors that would otherwise surface at runtime.
Warns of excessive complexity that degrades the ability to change code safely.
Identifies security vulnerabilities and coding patterns that provide attack vectors.
Enforces coding standards by removing subjective style debates from code reviews.
Alerts to dependency issues such as outdated packages, known CVEs, license
incompatibilities, or supply-chain compromises.
When to Use
Static analysis should run continuously, at every stage where feedback is possible:
In the IDE: real-time feedback as developers type, via editor plugins and language
server integrations.
On save: format-on-save and lint-on-save catch issues immediately.
Pre-commit: hooks prevent problematic code from entering version control.
In CI: the full suite of static checks runs on every PR and on the trunk after merge,
verifying that earlier local checks were not bypassed.
Static analysis is always applicable. Every project, regardless of language or platform,
benefits from linting, formatting, and dependency scanning.
Characteristics
Property
Value
Speed
Seconds (typically the fastest test category)
Determinism
Always deterministic
Scope
Entire codebase (source, config, dependencies)
Dependencies
None (analyzes code at rest)
Network
None (except dependency scanners)
Database
None
Breaks build
Yes
Examples
Linting
A .eslintrc.json configuration enforcing test quality rules:
Statically typed languages catch type mismatches at compile time, eliminating entire classes
of runtime errors. Java, for example, rejects incompatible argument types before the code runs:
Java type checking example
publicstaticdoublecalculateTotal(double price,int quantity){return price * quantity;}// Compiler error: incompatible types: String cannot be converted to doublecalculateTotal("19.99",3);
Dependency Scanning
Tools like npm audit, Snyk, or Dependabot scan for known vulnerabilities:
npm audit output example
$ npm audit
found 2 vulnerabilities (1 moderate, 1 high)
moderate: Prototype Pollution in lodash <4.17.21
high: Remote Code Execution in log4j <2.17.1
Flags overly deep or long code blocks that breed defects
Type checking
Prevents type-related bugs, replacing some unit tests
Security scanning
Detects known vulnerabilities and dangerous coding patterns
Dependency scanning
Checks for outdated, hijacked, or insecurely licensed deps
Accessibility linting
Detects missing alt text, ARIA violations, contrast failures, semantic HTML issues
Accessibility Linting
Accessibility linting catches deterministic WCAG violations the same way a security scanner
catches known vulnerability patterns. Automated checks cover structural issues (missing alt
text, invalid ARIA attributes, insufficient contrast ratios, broken heading hierarchy) while
manual review covers subjective aspects like whether alt text is actually meaningful.
A .pa11yci configuration running WCAG 2.1 AA checks against rendered pages:
An axe-core unit test asserting that a rendered component has no accessibility violations:
axe-core accessibility test with jest-axe
import{ axe, toHaveNoViolations }from"jest-axe";
expect.extend(toHaveNoViolations);it("should have no accessibility violations",async()=>{const{ container }=render(<LoginForm />);const results =awaitaxe(container);expect(results).toHaveNoViolations();});
Anti-Patterns
Disabling rules instead of fixing code: suppressing linter warnings or ignoring
security findings erodes the value of static analysis over time.
Not customizing rules: default rulesets are a starting point. Write custom rules for
patterns that come up repeatedly in code reviews.
Running static analysis only in CI: by the time CI reports a formatting error, the
developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
Ignoring dependency vulnerabilities: known CVEs in dependencies are a direct attack
vector. Treat high-severity findings as build-breaking.
Treating static analysis as optional: static checks should be mandatory and enforced.
If developers can bypass them, they will.
Connection to CD Pipeline
Static analysis is the first gate in the CDpipeline, providing the fastest feedback:
IDE / local development: plugins run in real time as code is written.
Pre-commit: hooks run linters, formatters, and accessibility checks on changed
components, blocking commits that violate rules.
PR verification: CI runs the full static analysis suite (linting, type checking,
security scanning, dependency auditing, accessibility linting) and blocks merge on
failure.
Trunk verification: the same checks re-run on the merged HEAD to catch anything
missed.
Scheduled scans: dependency and security scanners run on a schedule to catch newly
disclosed vulnerabilities in existing dependencies.
Because static analysis requires no running code, no test environment, and no external
dependencies, it is the cheapest and fastest form of quality verification. A mature CD
pipeline treats static analysis failures the same as test failures: they break the build.
Patterns for isolating dependencies in tests: stubs, mocks, fakes, spies, and dummies.
Definition
Test doubles are stand-in objects that replace real production dependencies during testing.
The term comes from the film industry’s “stunt double.” Just as a stunt double replaces an
actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency
to make tests fast, isolated, and reliable.
Test doubles allow you to:
Remove non-determinism by replacing network calls, databases, and file systems with
predictable substitutes.
Control test conditions by forcing specific states, error conditions, or edge cases that
would be difficult to reproduce with real dependencies.
Increase speed by eliminating slow I/O operations.
Isolate the system under test so that failures point directly to the code being tested,
not to an external dependency.
Types of Test Doubles
Type
Description
Example Use Case
Dummy
Passed around but never actually used. Fills parameter lists.
A required logger parameter in a constructor.
Stub
Provides canned answers to calls made during the test. Does not respond to anything outside what is programmed.
Returning a fixed user object from a repository.
Spy
A stub that also records information about how it was called (arguments, call count, order).
Verifying that an analytics event was sent once.
Mock
Pre-programmed with expectations about which calls will be made. Verification happens on the mock itself.
Asserting that sendEmail() was called with specific arguments.
Fake
Has a working implementation, but takes shortcuts not suitable for production.
An in-memory database replacing PostgreSQL.
Choosing the Right Double
Use stubs when you need to supply data but do not care how it was requested.
Use spies when you need to verify call arguments or call count.
Use mocks when the interaction itself is the primary thing being verified.
Use fakes when you need realistic behavior but cannot use the real system.
Use dummies when a parameter is required by the interface but irrelevant to the test.
When to Use
Test doubles are used in every layer of deterministic testing:
Unit tests: nearly all dependencies are replaced with test doubles to
achieve full isolation.
Integration tests: external sub-systems (APIs, databases, message
queues) are replaced, but internal collaborators remain real.
Functional tests: dependencies that cross the sub-system boundary
are replaced to maintain determinism.
Test doubles should be used less in later pipeline stages.
End-to-end tests use no test doubles by design.
Examples
A JavaScript stub providing a canned response:
JavaScript stub returning a fixed user
// Stub: return a fixed user regardless of inputconst userRepository ={findById: jest.fn().mockResolvedValue({id:"u1",name:"Ada Lovelace",email:"ada@example.com",}),};const user =await userService.getUser("u1");expect(user.name).toBe("Ada Lovelace");
A Java spy verifying interaction:
Java spy verifying call count with Mockito
@TestpublicvoidshouldCallUserServiceExactlyOnce(){UserService spyService =Mockito.spy(userService);doReturn(testUser).when(spyService).getUserInfo("u123");User result = spyService.getUserInfo("u123");verify(spyService,times(1)).getUserInfo("u123");assertEquals("Ada", result.getName());}
Mocking what you do not own: wrapping a third-party API in a thin adapter and mocking
the adapter is safer than mocking the third-party API directly. Direct mocks couple your
tests to the library’s implementation.
Over-mocking: replacing every collaborator with a mock turns the test into a mirror of
the implementation. Tests become brittle and break on every refactor. Only mock what is
necessary to maintain determinism.
Not validating test doubles: if the real dependency changes its contract, your test
doubles silently drift. Use contract tests to keep doubles honest.
Complex mock setup: if setting up mocks requires dozens of lines, the system under test
may have too many dependencies. Consider refactoring the production code rather than adding
more mocks.
Using mocks to test implementation details: asserting on the exact sequence and count
of internal method calls creates change-detector tests. Prefer asserting on observable
output.
Connection to CD Pipeline
Test doubles are a foundational technique that enables the fast, deterministic tests required
for continuous delivery:
Early pipeline stages (static analysis, unit tests, integration tests) rely heavily on
test doubles to stay fast and deterministic. This is where the majority of defects are
caught.
Later pipeline stages (E2E tests, production monitoring) use fewer or no test doubles,
trading speed for realism.
Contract tests run asynchronously to validate that test doubles still match reality,
closing the gap between the deterministic and non-deterministic stages of the pipeline.
The guiding principle from Justin Searls applies: “Don’t poke too many holes in reality.”
Use test doubles when you must, but prefer real implementations when they are fast and
deterministic.
Why test suite speed matters for developer effectiveness and how cognitive limits set the targets.
Why speed has a threshold
The 10-minute CI target and the preference for sub-second unit tests are not arbitrary. They come
from how human cognition handles interrupted work. When a developer makes a change and waits for
test results, three things determine whether that feedback is useful: whether the developer still
holds the mental model of the change, whether they can act on the result immediately, and whether
the wait is short enough that they do not context-switch to something else.
Research on task interruption and working memory consistently shows that context switches are
expensive. Gloria Mark’s research at UC Irvine found that it takes an average of 23 minutes for
a person to fully regain deep focus after being interrupted during a task, and that interrupted
tasks take twice as long and contain twice as many errors as uninterrupted
ones.1 If the test suite itself takes 30 minutes, the total cost of a single
feedback cycle approaches an hour - and most of that time is spent re-loading context, not fixing
code.
The cognitive breakpoints
Jakob Nielsen’s foundational research on response times identified three thresholds that govern
how users perceive and respond to system delays: 0.1 seconds (feels instantaneous), 1 second
(noticeable but flow is maintained), and 10 seconds (attention limit - the user starts thinking
about other things).2 These thresholds, rooted in human perceptual and
cognitive limits, apply directly to developer tooling.
Different feedback speeds produce fundamentally different developer behaviors:
Feedback time
Developer behavior
Cognitive impact
Under 1 second
Feels instantaneous. The developer stays in flow, treating the test result as part of the editing cycle.2
Working memory is fully intact. The change and the result are experienced as a single action.
1 to 10 seconds
The developer waits. Attention may drift briefly but returns without effort.
Working memory is intact. The developer can act on the result immediately.
10 seconds to 2 minutes
The developer starts to feel the wait. They may glance at another window or check a message, but they do not start a new task.
Working memory begins to decay. The developer can still recover context quickly, but each additional second increases the chance of distraction.2
2 to 10 minutes
The developer context-switches. They check email, review a PR, or start thinking about a different problem. When the result arrives, they must actively return to the original task.
Working memory is partially lost. Rebuilding context takes several minutes depending on the complexity of the change.1
Over 10 minutes
The developer fully disengages and starts a different task. The test result arrives as an interruption to whatever they are now doing.
Working memory of the original change is gone. Rebuilding it takes upward of 23 minutes.1 Investigating a failure means re-reading code they wrote an hour ago.
The 10-minute CI target exists because it is the boundary between “developer waits and acts on
the result” and “developer starts something else and pays a full context-switch penalty.” Below
10 minutes, feedback is actionable. Above 10 minutes, feedback becomes an interruption. DORA’s
research on continuous integration reinforces this: tests should complete in under 10 minutes to
support the fast feedback loops that high-performing teams depend on.3
What this means for test architecture
These cognitive breakpoints should drive how you structure your test suite:
Local development (under 1 second). Unit tests for the code you are actively changing should
run in watch mode, re-executing on every save. At this speed, TDD becomes natural - the test
result is part of the writing process, not a separate step. This is where you test complex logic
with many permutations.
Pre-push verification (under 2 minutes). The full unit test suite and the functional tests
for the component you changed should complete before you push. At this speed, the developer
stays engaged and acts on failures immediately. This is where you catch regressions.
CI pipeline (under 10 minutes). The full deterministic suite - all unit tests, all functional
tests, all integration tests - should complete within 10 minutes of commit. At this speed, the
developer has not yet fully disengaged from the change. If CI fails, they can investigate while
the code is still fresh.
Post-deploy verification (minutes to hours). E2E smoke tests and contract test validation
run after deployment. These are non-deterministic, slower, and less frequent. Failures at this
level trigger investigation, not immediate developer action.
When a test suite exceeds 10 minutes, the solution is not to accept slower feedback. It is to
redesign the suite: replace E2E tests with functional tests using test doubles, parallelize test
execution, and move non-deterministic tests out of the gating path.
Impact on application architecture
Test feedback speed is not just a testing concern - it puts pressure on how you design your
systems. A monolithic application with a single test suite that takes 40 minutes to run forces
every developer to pay the full context-switch penalty on every change, regardless of which
module they touched.
Breaking a system into smaller, independently testable components is often motivated as much by
test speed as by deployment independence. When a component has its own focused test suite that
runs in under 2 minutes, the developer working on that component gets fast, relevant feedback.
They do not wait for tests in unrelated modules to finish.
This creates a virtuous cycle: smaller components with clear boundaries produce faster test
suites, which enable more frequent integration, which encourages smaller changes, which are
easier to test. Conversely, a tightly coupled monolith produces a slow, tangled test suite that
discourages frequent integration, which leads to larger changes, which are harder to test and
more likely to fail.
Architecture decisions that improve test feedback speed include:
Clear component boundaries with well-defined interfaces, so each component can be tested
in isolation with test doubles for its dependencies.
Separating business logic from infrastructure so that core rules can be unit tested in
milliseconds without databases, queues, or network calls.
Independently deployable services with their own test suites, so a change to one service
does not require running the entire system’s tests.
Avoiding shared mutable state between components, which forces integration tests and
introduces non-determinism.
If your test suite is slow and you cannot make it faster by optimizing test execution alone, the
architecture is telling you something. A system that is hard to test quickly is also hard to
change safely - and both problems have the same root cause.
The compounding cost of slow feedback
Slow feedback does not just waste time - it changes behavior. When the suite takes 40 minutes,
developers adapt:
They batch changes to avoid running the suite more than necessary, creating larger and riskier
commits.
They stop running tests locally because the wait is unacceptable during active development.
They push to CI and context-switch, paying the full rebuild penalty on every cycle.
They rerun failures instead of investigating, because re-reading the code they wrote an hour
ago is expensive enough that “maybe it was flaky” feels like a reasonable bet.
Each of these behaviors degrades quality independently. Together, they make continuous integration
impossible. A team that cannot get feedback on a change within 10 minutes cannot sustain the
practice of integrating changes multiple times per day.4
Sources
Further reading
Build Duration - Measuring and improving CI pipeline speed
Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps, IT Revolution Press, 2018. ↩︎
5.9 - Testing Glossary
Definitions for testing terms as they are used on this site.
These definitions reflect how this site uses each term. They are not universal definitions -
other communities may use the same words differently.
Black Box Testing
A testing approach where the test exercises code through its public interface and asserts
only on observable outputs - return values, state changes visible to consumers, or side
effects such as messages sent. The test has no knowledge of internal implementation details.
Black box tests are resilient to refactoring because they verify what the code does, not
how it does it. Contrast with white box testing.
Automated tests that verify a system behaves as specified. Functional acceptance tests
exercise end-to-end user workflows in a
production-like environment and confirm the implementation
matches the acceptance criteria. They answer “did we build what was specified?” rather than
“does the code work?” They do not validate whether the specification itself is correct -
only real user feedback can confirm we are building the right thing.
A development practice where tests are written before the production code that makes them
pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing
a fast, reliable test suite. TDD feeds into the testing fundamentals
required in Phase 1.
A test double that simulates a real external service over the network, responding to HTTP
requests with pre-configured or recorded responses. Unlike in-process stubs or mocks, a
virtual service runs as a standalone process and is accessed via real network calls, making
it suitable for functional testing and integration testing where your application needs to
make actual HTTP requests against a dependency. Tools such as WireMock, Mountebank, and
Hoverfly can create virtual services from recorded traffic or API specifications. See
Test Doubles.
A testing approach where the test has knowledge of and asserts on internal implementation
details - specific methods called, call order, internal state, or code paths taken. White
box tests verify how the code works, not what it produces. These tests are fragile
because any refactoring of internals breaks them, even when behavior is unchanged. Avoid
white box testing in unit tests; prefer black box testing that asserts
on observable outcomes.
The practices that drive software delivery performance, as identified by DORA research.
The DevOps Research and Assessment (DORA) research program has identified practices that
predict high software delivery performance. These practices are not tools or technologies.
They are cultural conditions and behaviors that enable teams to deliver software quickly,
reliably, and sustainably.
This page organizes the DORA recommended practices by their relevance to each migration phase. Use it
as a reference to understand which practices you are building at each stage of your journey
and which ones to focus on next.
Using This Table
“Primary” means the phase where the practice is the main focus of improvement work.
“Ongoing” means the practice is relevant in every phase and should be continuously
nurtured. “Started” or “Expanded” means the practice is introduced or deepened in that
phase. No entry means the practice is not a primary concern in that phase, though it may
still be relevant.
These practices directly support the mechanics of getting software from commit to production.
They are the primary focus of Phases 1 and 2 of the migration.
Version Control
All production artifacts (application code, test code, infrastructure configuration,
deployment scripts, and database schemas) are stored in version control and can be
reproduced from a single source of truth.
Migration relevance: This is a prerequisite for Phase 1. If any part of your delivery
process depends on files stored on a specific person’s machine or a shared drive, address that
before beginning the migration.
Continuous Integration
Developers integrate their work to trunk at least daily. Each integration triggers an
automated build and test process. Broken builds are fixed within minutes.
Developers work in small batches and merge to trunk at least daily. Branches, if used, are
short-lived (less than one day). There are no long-lived feature branches.
A comprehensive suite of automated tests provides confidence that the software is deployable.
Tests are reliable, fast, and maintained as carefully as production code.
Test data is managed in a way that allows automated tests to run independently, repeatably,
and without relying on shared mutable state. Tests can create and clean up their own data.
Security is integrated into the development process rather than added as a gate at the end.
Automated security checks run in the pipeline. Security requirements are part of the
definition of deployable.
Migration relevance: Integrated during Phase 2: Pipeline Architecture
as automated quality gates rather than manual review steps.
Architecture Practices
These practices address the structural characteristics of your system that enable or prevent
independent, frequent deployment.
Loosely Coupled Architecture
Teams can deploy their services independently without coordinating with other teams. Changes
to one service do not require changes to other services. APIs have well-defined contracts.
These practices address how work is planned, prioritized, and delivered.
Customer Feedback
Product decisions are informed by direct feedback from customers. Teams can observe how
features are used in production and adjust accordingly.
Migration relevance: Becomes fully enabled in Phase 4: Deliver on Demand
when every change reaches production quickly enough for real customer feedback to inform
the next change.
Value Stream Visibility
The team has a clear view of the entire delivery process from request to production, including
wait times, handoffs, and rework loops.
Migration relevance:Phase 0: Value Stream Mapping.
This is the first activity in the migration because it informs every decision that follows.
Working in Small Batches
Work is broken down into small increments that can be completed, tested, and deployed
independently. Each increment delivers measurable value or validated learning.
Teams have explicit WIP limits that constrain the number of items in any stage of the delivery
process. WIP limits are enforced and respected.
Migration relevance:Phase 3: Limiting WIP. Reducing WIP
is one of the most effective ways to improve lead time and delivery predictability.
Visual Management
The state of all work is visible to the entire team through dashboards, boards, or other
visual tools. Anyone can see what is in progress, what is blocked, and what has been deployed.
Migration relevance: All phases. Visual management supports the identification of
constraints in Phase 0 and the enforcement of WIP limits in Phase 3.
Monitoring and Observability
Teams have access to production metrics, logs, and traces that allow them to understand system
behavior, detect issues, and diagnose problems quickly.
Teams are alerted to problems before customers are affected. Monitoring thresholds and
anomaly detection trigger notifications that enable rapid response.
Migration relevance: Becomes critical in Phase 4 when deployments are continuous and
automated. Proactive notification is what makes continuous deployment safe.
Collaboration Among Teams
Development, operations, security, and product teams work together rather than in silos.
Handoffs are minimized. Shared responsibility replaces blame.
Migration relevance: All phases, but especially Phase 2: Pipeline
where the pipeline must encode the quality criteria from all disciplines (security, testing,
operations) into automated gates.
Practices Relevant in Every Phase
The following practices are not tied to a specific migration phase. They are conditions
that support every phase and should be cultivated continuously throughout the migration.
Empowered Teams. Teams choose their own tools, technologies, and approaches within
organizational guardrails. Teams that cannot make local decisions about their pipeline, test
strategy, or deployment approach will be unable to iterate quickly enough to make progress.
Team Experimentation. Teams can try new ideas, tools, and approaches without requiring
lengthy approval. Failed experiments are treated as learning, not waste. The migration itself
is an experiment that requires psychological safety and organizational support.
Generative Culture. Following Ron Westrum’s typology, a generative culture is characterized
by high cooperation, shared risk, and focus on the mission. Teams in pathological or
bureaucratic cultures will struggle with every phase because practices like TBD and CI require
trust and psychological safety.
Learning Culture. The organization invests in learning. Teams have time for experimentation,
training, and knowledge sharing. The CD migration is a learning journey that requires time and
space to learn new practices, make mistakes, and improve.
Job Satisfaction. Team members find their work meaningful and have the autonomy and resources
to do it well. The migration should improve job satisfaction by reducing
toil and giving teams faster feedback. If the migration is experienced as a
burden, something is wrong with the approach.
Transformational Leadership. Leaders support the migration with vision, resources, and
organizational air cover. Without leadership support, the migration will stall when it
encounters the first organizational blocker.
Visual guide showing how CD practices depend on and build upon each other.
The full interactive dependency tree is at
practices.minimumcd.org. This page summarizes the key
dependency chains and how they map to the migration phases in this guide.
Continuous delivery is not a single practice you adopt. It is a system of interdependent
practices where each one supports and enables others. Understanding these dependencies helps
you plan your migration in the right order, addressing foundational practices before building
on them.
Using the Tree to Diagnose Problems
When something in your delivery process is not working, trace it through the dependency tree
to find the root cause.
Deployments keep failing.
Look at what feeds CD in the tree. Is your pipeline deterministic? Are you using immutable
artifacts? Is your application config externalized? The failure is likely in one of the
pipeline practices.
CI builds are constantly broken.
Look at what feeds CI. Are developers actually practicing TBD (integrating daily)? Is the test
suite reliable, or is it full of flaky tests? Is the build automated end-to-end? The broken
builds are a symptom of a problem in the development practices layer.
You cannot reduce batch size.
Look at what feeds small batches. Is work being decomposed into vertical slices? Are feature
flags available so partial work can be deployed safely? Is the architecture decoupled enough
to allow independent deployment? The batch size problem originates in one of these upstream
practices.
Every feature requires cross-team coordination to deploy.
Look at team structure. Are teams organized around domains they can deliver independently, or
around technical layers that force handoffs for every feature? If deploying a feature requires
the frontend team, backend team, and DBA team to coordinate a release window, the team
structure is preventing independent delivery. No amount of pipeline automation fixes this.
The team boundaries need to change.
Migration Tip
When you encounter a problem, resist the urge to fix the symptom. Use the
dependency tree to trace the problem to its root cause.
Fixing the symptom (for example, adding more manual testing to catch deployment failures) will
not solve the underlying issue and often adds toil that makes things worse. Fix the dependency
that is broken, and the downstream problem resolves itself.
Mapping to Migration Phases
The dependency tree directly informs the sequencing of migration phases:
Dependency Layer
Migration Phase
Why This Order
Development practices (BDD, trunk-based development)
These cross-cutting practices support every phase. Team structure should be addressed early because it constrains architecture and work decomposition
Understanding the Dependency Model
How Dependencies Work
CD sits at the top of the tree. It depends directly on many practices, each of which has its own
dependencies. When practice A depends on practice B, it means B is a prerequisite or enabler
for A. You cannot reliably adopt A without B in place.
For example, continuous delivery depends directly on:
Continuous testing, automated database changes, test environments
Integration
Continuous integration
Environment
Automated environment provisioning, monitoring and alerting
Organizational
Cross-functional product teams, developer-driven support, prioritized features
Development
ATDD, modular system design
Each of these has its own dependency chain. The application pipeline alone depends on automated
testing, deployment automation, automated artifact versioning, and quality gates. Automated
testing in turn depends on build automation. Build automation depends on version control and
dependency management. The chain runs deep.
Key Dependency Chains
BDD enables testing enables CI enables CD
Behavior-Driven Development produces clear, testable acceptance criteria. Those criteria drive
functional testing and acceptance test-driven development. A comprehensive, fast test suite
enables Continuous Integration with confidence. And CI is the foundational prerequisite for CD.
If your team skips BDD, stories are ambiguous. If stories are ambiguous, tests are incomplete
or wrong. If tests are unreliable, CI is unreliable. And if CI is unreliable, CD is impossible.
Trunk-Based Development enables CI
CI requires that all developers integrate to a shared trunk at least once per day. If your team
uses long-lived feature branches, you are not doing CI regardless of how often your build server
runs. TBD is not optional for CD. It is a prerequisite.
Cross-functional teams enable component ownership enables modular systems
How teams are organized determines what they can deliver independently. A team organized around a
domain (owning the services, data, and interfaces for that domain) can decompose work into
vertical slices within their boundary and deploy without
coordinating with other teams. A team organized around a technical layer (the “frontend team,”
the “DBA team”) cannot. Every feature requires handoffs across layer teams, and deployment
requires coordinating all of them.
Conway’s Law makes this structural: the system’s architecture will mirror the team structure.
In the dependency tree, cross-functional product teams enable component ownership, which enables
the modular system design that CD requires.
Version control is the root of everything
Nearly every automation practice traces back to version control. Build automation, configuration
management, infrastructure automation, and component ownership all depend on it. If your version
control practices are weak (infrequent commits, poor branching discipline, configuration stored
outside version control), the entire tree above it is compromised.
8 - Glossary
Key terms and definitions used throughout this guide.
This glossary defines the terms used across every phase of the CD migration guide. Where a term
has a specific meaning within a migration phase, the relevant phase is noted.
A
Acceptance Criteria
Concrete expectations for a change, expressed as observable outcomes that can be used as fitness
functions - executed as deterministic tests or evaluated by review agents. In
ACD, acceptance criteria include a done definition (what
“done” looks like from an observer’s perspective) and an evaluation design (test cases with
known-good outputs). They constrain the agent: comprehensive criteria prevent incorrect code
from passing, while shallow criteria allow code that passes tests but violates intent. See
Acceptance Criteria.
The application of continuous delivery in environments where software changes are proposed by
AI agents. ACD extends CD with additional constraints, delivery artifacts, and pipeline
enforcement to reliably constrain agent autonomy without slowing delivery. ACD assumes the
team already practices continuous delivery. Without that foundation, the agentic extensions
have nothing to extend. See Agentic Continuous Delivery.
An AI system that uses tool calls in a loop to complete multi-step tasks autonomously. Unlike a
single LLM call that returns a response, an agent can invoke tools, observe results, and decide
what to do next until a goal is met or a stopping condition is reached. An agent’s behavior is
shaped by its prompt - the complete set of instructions, context, and constraints it receives at
the start of a session. See Agentic CD.
A packaged, versioned output of a build process (e.g., a container image, JAR file, or binary).
In a CD pipeline, artifacts are built once and promoted through environments without
modification. See Immutable Artifacts.
The set of delivery measurements taken before beginning a migration, used as the benchmark
against which improvement is tracked. See Phase 0 - Baseline Metrics.
The amount of change included in a single deployment. Smaller batches reduce risk, simplify
debugging, and shorten feedback loops. Reducing batch size is a core focus of
Phase 3 - Small Batches.
A collaboration practice where developers, testers, and product representatives define expected
behavior using structured examples before code is written. BDD produces executable
specifications that serve as both documentation and automated tests. BDD supports effective
work decomposition by forcing clarity about what a
story actually means before development begins.
A deployment strategy that maintains two identical production environments. New code is deployed
to the inactive environment, verified, and then traffic is switched. See
Progressive Rollout.
The elapsed time between creating a branch and merging it to trunk. CD requires branch lifetimes
measured in hours, not days or weeks. Long branch lifetimes are a symptom of poor work
decomposition or slow code review. See Trunk-Based Development.
A deployment strategy where a new version is rolled out to a small subset of users or servers
before full rollout. If the canary shows no issues, the deployment proceeds to 100%. See
Progressive Rollout.
The practice of ensuring that every change to the codebase is always in a deployable state and
can be released to production at any time through a fully automated pipeline. Continuous
delivery does not require that every change is deployed automatically, but it requires that
every change could be deployed automatically. This is the primary goal of this migration
guide.
The percentage of deployments to production that result in a degraded service and require
remediation (e.g., rollback, hotfix, or patch). One of the four DORA metrics. See
Metrics - Change Fail Rate.
The practice of integrating code changes to a shared trunk at least once per day, where each
integration is verified by an automated build and test suite. CI is a prerequisite for CD, not
a synonym. A team that runs automated builds on feature branches but merges weekly is not doing
CI. See Build Automation.
In the Theory of Constraints, the single factor most limiting the throughput of a system.
During a CD migration, your job is to find and fix constraints in order of impact. See
Identify Constraints.
The complete assembled input provided to an LLM for a single inference call. Context includes
the system prompt, tool definitions, any reference material or documents, conversation history,
and the current user request. “Context” and “prompt” are often used interchangeably; the
distinction is that “context” emphasizes what information is present, while “prompt” emphasizes
the structured input as a whole. Context is measured in tokens. As context grows, costs
and latency increase and performance can degrade when relevant information is buried far from
the end of the context. See Tokenomics.
The maximum number of tokens an LLM can process in a single call, spanning both input and
output. The context window is a hard limit; exceeding it requires truncation or a redesigned
approach. Large context windows (150,000+ tokens) create false confidence - more available
space does not mean better performance, and filling the window increases both latency and cost.
See Tokenomics.
An extension of continuous delivery where every change that passes the automated pipeline is
deployed to production without manual intervention. Continuous delivery ensures every change
can be deployed; continuous deployment ensures every change is deployed. See
Phase 4 - Deliver on Demand.
A change that has passed all automated quality gates defined by the team and is ready for
production deployment. The definition of deployable is codified in the pipeline, not decided
by a person at deployment time. See Deployable Definition.
The elapsed time from the first commit on a change to that change being deployable. This
measures the efficiency of your development and pipeline process, excluding upstream wait times.
See Metrics - Development Cycle Time.
Dependency
Code, service, or resource whose behavior is not defined in the current module. Dependencies
vary by location and ownership:
Internal dependency - code in another file or module within the same repository, or in
another repository your team controls. Internal dependencies share your release cycle and
your team can change them directly.
External dependency - a third-party library, external API, or
managed service outside your team’s direct control.
The distinction matters for testing. Internal dependencies are part of your own codebase and
should be exercised through real code paths in tests. Replacing them with
test doubles couples your tests to
implementation details and causes rippling failures during routine refactoring. Reserve test
doubles for external dependencies and runtime connections where real
invocation is impractical or non-deterministic.
The four key metrics identified by the DORA (DevOps Research and Assessment) research program
as predictive of software delivery performance: deployment frequency, lead time for changes,
change failure rate, and mean time to restore service. See DORA Recommended Practices.
A dependency on code or services outside your team’s direct control. External
dependencies include third-party libraries, public APIs, managed cloud services, and any
resource whose release cycle and availability your team cannot influence.
External dependencies are the primary case where test doubles add value. A test double for an
external API verifies your integration logic without relying on network availability or
third-party rate limits. By contrast, mocking internal code - another class in the same
repository or a module your team owns - creates fragile tests that break whenever the internal
implementation changes, even when the behavior is correct.
When evaluating whether to mock something, ask: “Can my team change this code and release it
in our pipeline?” If yes, it is an internal dependency and should be tested through real code
paths. If no, it is an external dependency and a test double is appropriate.
A team organized around user-facing features or customer journeys rather than owned product
subdomains. A feature team is cross-functional - it contains the skills to deliver a feature
end-to-end - but it does not own a stable domain of code. Multiple feature teams may modify
the same components, with no single team accountable for quality or consistency within them.
In practice: feature teams must re-orient on code they do not continuously maintain each time
a feature requires it; quality agreements cannot be enforced within the team because other
teams also modify the same code; and while feature teams appear to minimize inter-team
dependencies, they produce the opposite - everyone who can change a component is effectively
on the same large, loosely communicating team. Feature teams are structurally equivalent to
long-lived project teams.
A mechanism that allows code to be deployed to production with new functionality disabled,
then selectively enabled for specific users, percentages of traffic, or environments. Feature
flags decouple deployment from release. See Feature Flags.
The ratio of active work time to total elapsed time in a delivery process. A flow efficiency of
15% means that for every hour of actual work, roughly 5.7 hours are spent waiting. Value stream
mapping reveals your flow efficiency. See Value Stream Mapping.
A team that owns every layer of a user-facing capability - UI, API, and data store - and whose
public interface is designed for human users. A vertical slice for a full-stack product team
delivers one observable behavior from the user interface through to the database. The slice is
done when a user can observe the behavior through that interface. Contrast with
subdomain product team.
A branching model created by Vincent Driessen in 2010 that uses multiple long-lived branches
(main, develop, release/*, hotfix/*, feature/*) with specific merge rules and
directions. GitFlow was designed for infrequent, scheduled releases and is fundamentally
incompatible with continuous delivery because it defers integration, creates multiple paths
to production, and adds merge complexity. See the
TBD Migration Guide
for a step-by-step path from GitFlow to trunk-based development.
A dependency that must be resolved before work can proceed. In delivery, hard dependencies
include things like waiting for another team’s API, a shared database migration, or an
infrastructure provisioning request. Hard dependencies create queues and increase lead time.
Eliminating hard dependencies is a focus of
Architecture Decoupling.
A sprint dedicated to stabilizing and fixing defects before a release. The existence of
hardening sprints is a strong signal that quality is not being built in during regular
development. Teams practicing CD do not need hardening sprints because every commit is
deployable. See Testing Fundamentals.
An approach that frames every change as an experiment with a predicted outcome. Instead of
specifying a change as a requirement to implement, the team states a hypothesis: “We believe
[this change] will produce [this outcome] because [this reason].” After deployment, the team
validates whether the predicted outcome occurred. Changes that confirm the hypothesis build
confidence. Changes that refute it produce learning that informs the next hypothesis. This
creates a feedback loop where every deployed change generates a signal, whether it “succeeds”
or not. See Hypothesis-Driven Development
for the full lifecycle and
Agent Delivery Contract
for how hypotheses integrate with specification artifacts.
A build artifact that is never modified after creation. The same artifact that is tested in the
pipeline is the exact artifact that is deployed to production. Configuration differences between
environments are handled externally. See Immutable Artifacts.
The elapsed time from when a production incident is detected to when service is restored. One
of the four DORA metrics. Teams practicing CD have short MTTR because deployments are small,
rollback is automated, and the cause of failure is easy to identify. See
Metrics - Mean Time to Repair.
A single deployable application whose codebase is organized into well-defined modules with
explicit boundaries. Each module encapsulates a bounded domain and communicates with other
modules through defined interfaces, not by reaching into shared database tables or calling
internal methods directly. The application deploys as one unit, but its internal structure
allows teams to reason about, test, and change one module independently. See
Pipeline Reference Architecture and
Premature Microservices.
An agent that coordinates the work of other agents. The orchestrator receives a high-level goal,
breaks it into sub-tasks, delegates those sub-tasks to specialized sub-agents, and
assembles the results. Because orchestrators accumulate context across multiple steps, context
hygiene at agent boundaries is especially important - what the orchestrator passes to each
sub-agent is a cost and quality decision. See Tokenomics.
A test or staging environment that matches production in configuration, infrastructure, and
data characteristics. Testing in environments that differ from production is a common source
of deployment failures. See Production-Like Environments.
The complete structured input provided to an LLM for a single inference call. A prompt is not
a one- or two-sentence question. In production agentic systems, a prompt is a composed document
that typically includes: a system instruction block (role definition, constraints, output format
requirements), tool definitions, relevant context (documents, code, conversation history), and
the user’s request or task description. The system instruction block and tool definitions alone
can consume thousands of tokens before any user content is included. Understanding what a prompt
actually contains is a prerequisite for effective tokenomics. See
Tokenomics.
A server-side optimization where stable portions of a prompt are stored and reused across
repeated calls instead of being processed as new input each time. Effective caching requires
placing static content (system instructions, tool definitions, reference documents) at the
beginning of the prompt so cache hits cover the maximum token span. Dynamic content (user
request, current state) goes at the end where it does not invalidate the cached prefix.
See Tokenomics.
The ability to revert a production deployment to a previous known-good state. CD requires
automated rollback that takes minutes, not hours. See Rollback.
A dependency that can be worked around or deferred. Unlike hard dependencies, soft dependencies
do not block work but may influence sequencing or design decisions. Feature flags can turn many
hard dependencies into soft dependencies by allowing incomplete integrations to be deployed in
a disabled state.
Story Points
A relative estimation unit used by some teams to forecast effort. Story points are frequently
misused as a productivity metric, which creates perverse incentives to inflate estimates and
discourages the small work decomposition that CD requires. If your organization uses story
points as a velocity target, see Metrics-Driven Improvement.
A specialized agent invoked by an orchestrator to perform a specific,
well-defined task. Sub-agents should receive only the context relevant to their task - not
the orchestrator’s full accumulated context. Passing oversized context bundles to sub-agents
is a common source of unnecessary token consumption and can degrade performance by burying
relevant information. See Tokenomics.
A team that owns a bounded subdomain within a larger distributed system - full-stack within
their service (API, business logic, data store) but not directly user-facing. Their public
interface is designed for machines: other services or teams consume it through a defined API
contract. A vertical slice for a subdomain product team delivers one observable behavior
through that contract. The slice is done when the API satisfies the agreed behavior for its
service consumers. Contrast with full-stack product team.
The static, stable instruction block placed at the start of a prompt that establishes
the model’s role, constraints, output format requirements, and tool definitions. Unlike the
user-provided portion of the prompt, system prompts change rarely between calls and are the
primary candidates for prompt caching. Keeping the system prompt concise and
placing it first maximizes cache effectiveness and reduces per-call input costs.
See Tokenomics.
A source-control branching model where all developers integrate to a single shared branch
(trunk) at least once per day. Short-lived feature branches (less than a day) are acceptable.
Long-lived feature branches are not. TBD is a prerequisite for CI, which is in turn a
prerequisite for CD. See Trunk-Based Development.
The billing and capacity unit for LLMs. A token is roughly three-quarters of an English word.
All LLM costs, latency, and context limits are measured in tokens, not words, sentences, or
API calls. Input and output tokens are priced and counted separately. Output tokens typically
cost 2-5x more than input tokens because generating tokens is computationally more expensive
than reading them. Frontier models cost 10-20x more per token than smaller alternatives.
See Tokenomics.
Repetitive, manual work related to maintaining a production service that is automatable, has
no lasting value, and scales linearly with service size. Examples include manual deployments,
manual environment provisioning, and manual test execution. Eliminating toil is a primary
benefit of building a CD pipeline.
Work that arrives outside the planned backlog - production incidents, urgent bug fixes,
ad hoc requests. High levels of unplanned work indicate systemic quality or operational
problems. Teams with high change failure rates generate their own unplanned work through
failed deployments. Reducing unplanned work is a natural outcome of improving change failure
rate through CD practices.
A visual representation of every step required to deliver a change from request to production,
showing process time, wait time, and percent complete and accurate at each step. The
foundational tool for Phase 0 - Assess.
A user story that delivers a thin slice of functionality across all layers of the system
(UI, API, database, etc.) rather than a horizontal slice that implements one layer completely.
Vertical slices are independently deployable and testable, which is essential for CD. Vertical
slicing is a core technique in Work Decomposition.
The number of work items that have been started but not yet completed. High WIP increases lead
time, reduces focus, and increases context-switching overhead. Limiting WIP is a key practice
in Phase 3 - Limiting WIP.
An explicit, documented set of team norms covering how work is defined, reviewed, tested, and
deployed. Working agreements create shared expectations and reduce friction. See
Working Agreements.
Frequently asked questions about continuous delivery and this migration guide.
About This Guide
Why does this migration guide exist?
Many teams say they want to adopt continuous delivery but do not know where to start. The CD
landscape is full of tools, frameworks, and advice, but there is no clear, sequenced path from
“we deploy monthly” to “we can deploy any change at any time.” This guide provides that path.
It is built on the MinimumCD definition of continuous delivery and
draws on practices from the Dojo Consortium and the
DORA research. The content is organized as a phased migration journey
from your current state to continuous delivery rather than as a description of what CD looks
like when you are already there.
Who is this guide for?
This guide is for development teams, tech leads, and engineering managers who want to improve
their software delivery practices. It is designed for teams that are currently deploying
infrequently (monthly, quarterly, or less) and want to reach a state where any change can be
deployed to production at any time.
You do not need to be starting from zero. If your team already has CI in place, you can begin
with Phase 2: Pipeline. If you have a pipeline but deploy infrequently, start
with Phase 3: Optimize. Use the Phase 0 assessment to find your
starting point.
Should we adopt this guide as an organization or as a team?
Start with a single team. CD adoption works best when a team can experiment, learn, and iterate
without waiting for organizational consensus. Once one team demonstrates results (shorter lead
times, lower change failure rate, more frequent deployments), other teams will have a concrete
example to follow.
Organizational adoption comes after team adoption, not before. The role of organizational
leadership is to create the conditions for teams to succeed: stable team composition, tool
funding, policy flexibility for deployment processes, and protection from pressure to cut
corners on quality.
How do we use this guide for improvement?
Start with Phase 0: Assess. Map your value stream, measure your current
performance, and identify your top constraints. Then work through the phases in order, focusing
on one constraint at a time.
The guide is not a checklist to complete in sequence. It is a reference that helps you decide
what to work on next. Some teams will spend months in Phase 1 building testing fundamentals.
Others will move quickly to Phase 2 because they already have strong development practices.
Your value stream map and metrics tell you where to invest.
Revisit your assessment periodically. As you improve, new constraints will emerge. The phases
give you a framework for addressing them.
Continuous Delivery Concepts
What is the difference between continuous delivery and continuous deployment?
Continuous delivery means every change to the codebase is always in a deployable state and
can be released to production at any time through a fully automated pipeline. The decision to
deploy may still be made by a human, but the capability to deploy is always present.
Continuous deployment is an extension of continuous delivery where every change that passes
the automated pipeline is deployed to production without manual intervention.
This migration guide takes you through continuous delivery (Phases 0-3) and then to continuous
deployment (Phase 4). Continuous delivery is the prerequisite. You cannot safely automate
deployment decisions until your pipeline reliably determines what is deployable.
Is continuous delivery the same as having a CD pipeline?
No. Many teams have a CD pipeline tool (Jenkins, GitHub Actions, GitLab CI, etc.) but are
not practicing continuous delivery. A pipeline tool is necessary but not sufficient.
Continuous delivery also requires trunk-based development, comprehensive test automation, a
single path to production, immutable artifacts, and the ability to deploy any green build.
If your team has a pipeline but uses long-lived feature branches, deploys only at the end of a
sprint, or requires manual testing before a release, you have a pipeline tool but you are not
practicing continuous delivery. The current-state checklist
in Phase 0 helps you assess the gap.
What does “the pipeline is the only path to production” mean?
It means there is exactly one way for any change to reach production: through the automated
pipeline. No one can SSH into a server and make a change. No one can skip the test suite for
an “urgent” fix. No one can deploy from their local machine.
This constraint is what gives you confidence. If every change in production has been through
the same build, test, and deployment process, you know what is running and how it got there.
If exceptions are allowed, you lose that guarantee, and your ability to reason about production
state degrades.
During your migration, establishing this single path is a key milestone in
Phase 2.
What does “application configuration” mean in the context of CD?
Application configuration refers to values that change between environments but are not part of
the application code: database connection strings, API endpoints, feature flag states, logging
levels, and similar settings.
In a CD pipeline, configuration is externalized. It lives outside the artifact and is injected
at deployment time. This is what makes immutable artifacts
possible. You build the artifact once and deploy it to any environment by providing the
appropriate configuration.
If configuration is embedded in the artifact (for example, hardcoded URLs or environment-specific
config files baked into a container image), you must rebuild the artifact for each environment,
which means the artifact you tested is not the artifact you deploy. This breaks the immutability
guarantee. See Application Config.
What is an “immutable artifact” and why does it matter?
An immutable artifact is a build output (container image, binary, package) that is never
modified after it is created. The exact artifact that passes your test suite is the exact
artifact that is deployed to staging, and then to production. Nothing is recompiled, repackaged,
or patched between environments.
This matters because it eliminates an entire category of deployment failures: “it worked in
staging but not in production” caused by differences in the build. If the same bytes are
deployed everywhere, build-related discrepancies are impossible.
Immutability requires externalizing configuration (see above) and storing artifacts in a
registry or repository. See Immutable Artifacts.
What does “deployable” mean?
A change is deployable when it has passed all automated quality gates defined in the pipeline.
The definition is codified in the pipeline itself, not decided by a person at deployment time.
The artifact is built and stored in the artifact registry
Deployment to a production-like environment succeeds
Smoke tests in the production-like environment pass
If any of these gates fail, the change is not deployable. The pipeline makes this determination
automatically and consistently. See Deployable Definition.
What is the difference between deployment and release?
Deployment is the act of putting code into a production environment.
Release is the act of making functionality available to users.
These are different events, and decoupling them is one of the most powerful techniques in CD.
You can deploy code to production without releasing it to users by using
feature flags. The code is running in production, but the new
functionality is disabled. When you are ready, you enable the flag and the feature is released.
This decoupling is important because it separates the technical risk (will the deployment
succeed?) from the business risk (will users like the feature?). You can manage each risk
independently. Deployments become routine technical events. Releases become deliberate business
decisions.
Migration Questions
How long does the migration take?
It depends on where you start and how much organizational support you have. As a rough guide:
Phase 0 (Assess): 1-2 weeks
Phase 1 (Foundations): 1-6 months, depending on current testing and TBD maturity
Phase 2 (Pipeline): 1-3 months
Phase 3 (Optimize): 2-6 months
Phase 4 (Deliver on Demand): 1-3 months
These ranges assume a single team working on the migration alongside regular delivery work.
The biggest variable is Phase 1: teams with no test automation or TBD practice will spend
longer building foundations than teams that already have these in place.
Do not treat these timelines as commitments. The migration is an iterative improvement process,
not a project with a deadline.
Do we stop delivering features during the migration?
No. The migration is done alongside regular delivery work, not instead of it. Each migration
practice is adopted incrementally: you do not stop the world to rewrite your test suite or
redesign your pipeline.
For example, in Phase 1 you adopt trunk-based development by reducing branch lifetimes
gradually: from two weeks to one week to two days to same-day. You add automated tests
incrementally, starting with the highest-risk code paths. You decompose work into smaller
stories one sprint at a time.
The migration practices themselves improve your delivery speed, so the investment pays off
as you go. Teams that have completed Phase 1 typically report delivering features faster than
before, not slower.
What if our organization requires manual change approval (CAB)?
Many organizations have Change Advisory Board (CAB) processes that require manual approval
before production deployments. This is one of the most common organizational blockers for CD.
The path forward is to replace the manual approval with automated evidence: a mature CD
pipeline provides stronger safety guarantees than a committee meeting, and your DORA metrics
can demonstrate this. Most CAB processes were designed for monthly releases with hundreds of
changes per batch; when you deploy daily with one or two changes, the risk profile is
fundamentally different. See CAB Gates
for a detailed approach to this transition.
What if we have a monolithic architecture?
You can practice continuous delivery with a monolith. CD does not require microservices. Many
of the highest-performing teams in the DORA research deploy monolithic applications multiple
times per day.
What matters is that your architecture supports independent testing and deployment. A
well-structured monolith with a comprehensive test suite and a reliable pipeline can achieve
CD. A poorly structured collection of microservices with shared databases and coordinated
releases cannot.
Architecture decoupling is addressed in Phase 3, but
it is about enabling independent deployment and reducing coordination costs, not about adopting
any particular architectural style.
What if our tests are slow or unreliable?
This is one of the most common starting conditions. A slow or flaky test suite undermines
every CD practice: developers stop trusting the tests, broken builds are ignored, and the
pipeline becomes a bottleneck rather than an enabler. The fix is incremental: quarantine
flaky tests, parallelize execution, rebalance toward fast unit tests, and set a pipeline
time budget (under 10 minutes). See
Testing Fundamentals and the
Testing reference section for detailed guidance.
Where do I start if I am not sure which phase applies to us?
If you do not have time for a full assessment, ask yourself these questions:
Do all developers integrate to trunk at least daily? If no, start with Phase 1.
Do you have a single automated pipeline that every change goes through? If no, start with Phase 2.
Can you deploy any green build to production on demand? If no, focus on the gap between your current state and Phase 2 completion criteria.
Do you deploy at least weekly? If no, look at Phase 3 for batch size and flow optimization.
Is CD about speed or quality?
Quality. The purpose of the pipeline is to validate that an artifact is production-worthy or
reject it. Do not chase daily deployments without first building confidence in your ability to
detect failure. Move validation as close to the developer as possible: run it on the desktop,
run it again on merge to trunk, run it again when the trunk changes.
Testing is not limited to functional tests. You need to test for security, compliance,
performance, and everything else required in your context. Set error budgets and do not exceed
them. When your error budget is spent, stop shipping features and invest in pipeline
hardening. When something breaks in production, harden the pipeline. When exploratory testing
uncovers an edge case, harden the pipeline. The primary goal is to build efficient and
effective quality gates. Only then can you move quickly.
10 - Resources
Books, videos, and further reading on continuous delivery and deployment.
This page collects the books, websites, and videos that inform the practices in this migration
guide. Resources are organized by topic and annotated with which migration phase they are most
relevant to.
Books
Continuous Delivery and Deployment
Modern Software Engineering by Dave Farley
Farley’s broader take on what it means to do software engineering well. Covers the principles
behind CD - iterating toward a goal, getting fast feedback, working in small steps - and
connects them to test-driven development, managing complexity, and designing for testability.
Useful for teams that want to understand the why behind CD practices, not just the how.
Most relevant to: All phases
Continuous Delivery Pipelines by Dave Farley
A practical, focused guide to building CD pipelines. Farley covers pipeline design, testing
strategies, and deployment patterns in a direct, implementation-oriented style. Start here
if you want a concise guide to the pipeline practices in Phase 2.
The foundational text on CD. Published in 2010, it remains the most comprehensive treatment
of the principles and practices that make continuous delivery work. Covers version control
patterns, build automation, testing strategies, deployment pipelines, and release management.
If you read one book before starting your migration, read this one.
Most relevant to: All phases
Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
Presents the DORA research findings that link technical practices to organizational
performance. Covers the four key metrics (deployment frequency, lead time, change failure
rate, MTTR) and the capabilities that predict high performance. Essential reading for anyone
who needs to make the business case for a CD migration.
Engineering the Digital Transformation by Gary Gruver
Addresses the organizational and leadership challenges of large-scale delivery
transformation. Gruver draws on his experience leading transformations at HP and other large
enterprises. Particularly valuable for leaders sponsoring a migration who need to understand
the change management, communication, and sequencing challenges ahead.
Most relevant to: Organizational leadership across all phases
Release It! by Michael T. Nygard
Covers the design and architecture patterns that make production systems resilient. Topics
include stability patterns (circuit breakers, bulkheads, timeouts), deployment patterns, and
the operational realities of running software at scale. Essential reading before entering
Phase 4, where the team has the capability to deploy any change on demand.
The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, and John Willis
A practical companion to The Phoenix Project. Covers the Three Ways (flow, feedback, and
continuous learning) and provides detailed guidance on implementing DevOps practices. Useful
as a reference throughout the migration.
Most relevant to: All phases
The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford
A novel that illustrates DevOps principles through the story of a fictional IT organization
in crisis. Useful for building organizational understanding of why delivery improvement
matters, especially for stakeholders who will not read a technical book.
Most relevant to: Building organizational buy-in during Phase 0
Testing
Growing Object-Oriented Software, Guided by Tests by Steve Freeman and Nat Pryce
The definitive guide to test-driven development in practice. Goes beyond unit testing to
cover acceptance testing, test doubles, and how TDD drives design. Essential reading for
Phase 1 testing fundamentals.
Working Effectively with Legacy Code by Michael Feathers
Practical techniques for adding tests to untested code, breaking dependencies, and
incrementally improving code that was not designed for testability. Indispensable if your
migration starts with a codebase that has little or no automated testing.
A practical guide to breaking features into deliverable increments using story maps. Patton’s
approach directly supports the vertical slicing discipline required for small batch delivery.
The Principles of Product Development Flow by Donald Reinertsen
A rigorous treatment of flow economics in product development. Covers queue theory, batch
size economics, WIP limits, and the cost of delay. Dense but transformative. Reading this
book will change how you think about every aspect of your delivery process.
Focuses on identifying and eliminating the “time thieves” that steal productivity: too much
WIP, unknown dependencies, unplanned work, conflicting priorities, and neglected work. A
practical companion to the WIP limiting practices in Phase 3.
Refactoring Databases: Evolutionary Database Design by Scott Ambler and Pramod Sadalage
The definitive guide to managing database schema changes incrementally. Covers expand-contract
migrations, backward-compatible schema changes, and techniques for evolving databases without
downtime. Essential reading for teams whose deployment pipeline includes database changes.
Covers the architectural patterns that enable independent deployment, including service
boundaries, API design, data management, and testing strategies for distributed systems.
Team Topologies by Matthew Skelton and Manuel Pais
Addresses the relationship between team structure and software architecture (Conway’s Law in
practice). Covers team types, interaction modes, and how to evolve team structures to support
fast flow. Valuable for addressing the organizational blockers that surface throughout the
migration.
Most relevant to: Organizational design across all phases
Defines the minimum set of practices required to claim you are doing continuous delivery.
This migration guide uses the MinimumCD definition as its target state. Start here to
understand what CD actually requires.
A community-maintained collection of CD practices, metrics definitions, and improvement
patterns. Many of the definitions and frameworks in this guide are adapted from the Dojo
Consortium’s work.
The DevOps Research and Assessment site, which publishes the annual State of DevOps report
and provides resources for measuring and improving delivery performance.
The comprehensive reference for trunk-based development patterns. Covers short-lived
feature branches, feature flags, branch by abstraction, and release branching strategies.
Martin Fowler’s site contains authoritative articles on continuous integration, continuous
delivery, microservices, refactoring, and software design. Key articles include
“Continuous Integration” and “Continuous Delivery.”
Dave Farley’s YouTube channel provides weekly videos covering CD practices, pipeline design,
testing strategies, and software engineering principles. Accessible and practical.
Most relevant to: All phases
“Continuous Delivery” by Jez Humble (various conference talks)
Jez Humble’s conference presentations cover the principles and research behind CD. His talk
“Why Continuous Delivery?” is an excellent introduction for teams and stakeholders who are
new to the concept.
Most relevant to: Building understanding during Phase 0
“Refactoring” and “TDD” talks by Martin Fowler and Kent Beck
Foundational talks on the development practices that support CD. Understanding TDD and
refactoring is essential for Phase 1 testing fundamentals.
“The Smallest Thing That Could Possibly Work” by Bryan Finster
Covers the work decomposition and small batch delivery practices that are central to this
migration guide. Focuses on practical techniques for breaking work into vertical slices.
A concrete walkthrough of a production deployment pipeline in a regulated financial services
environment. Demonstrates that CD practices are compatible with compliance requirements.
An article-length overview of deployment pipeline structure, covering commit stage, acceptance
testing, and release stages. A good companion to the pipeline phase of this guide.
If you are starting your migration and want to read in the most useful order:
Accelerate, to understand the research and build the business case
Continuous Delivery (Humble & Farley), to understand the full picture
Continuous Delivery Pipelines (Farley), for practical pipeline implementation
Working Effectively with Legacy Code, if your codebase lacks tests
The Principles of Product Development Flow, to understand flow optimization
Release It!, before moving to continuous deployment
Migration Tip
You do not need to read all of these before starting your migration. Start with the practices
in Phase 1, read Accelerate for the business case, and refer to the other resources as you
reach the relevant migration phase. The most important thing is to start delivering
improvements, not to finish a reading list.