This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Migrate to CD

A phased approach to adopting continuous delivery, from assessing your current state through full continuous deployment.

1: Phase 0: Assess

1.1: Value Stream Mapping
1.2: Baseline Metrics
1.3: Identify Constraints
1.4: Current State Checklist

2: Phase 1: Foundations

2.1: Trunk-Based Development

2.1.1: TBD Migration Guide

2.2: Testing Fundamentals
2.3: Build Automation
2.4: Work Decomposition
2.5: Code Review
2.6: Working Agreements
2.7: Everything as Code

3: Phase 2: Pipeline

3.1: Single Path to Production
3.2: Deterministic Pipeline
3.3: Deployable Definition
3.4: Immutable Artifacts
3.5: Application Configuration
3.6: Production-Like Environments
3.7: Pipeline Architecture
3.8: Rollback

4: Phase 3: Optimize

4.1: Small Batches
4.2: Feature Flags
4.3: Limiting Work in Progress
4.4: Metrics-Driven Improvement
4.5: Retrospectives
4.6: Architecture Decoupling
4.7: Team Alignment to Code
4.8: Hypothesis-Driven Development

5: Phase 4: Deliver on Demand

5.1: Deploy on Demand
5.2: Progressive Rollout
5.3: Experience Reports

6: Migrating Brownfield to CD

6.1: Document Your Current Process
6.2: Replacing Manual Validations with Automation

7: CD for Greenfield Projects

Choose the path that matches your situation. Brownfield teams migrating existing systems and greenfield teams building from scratch each have a dedicated guide. The phases below provide the roadmap both approaches follow.

The Phases

Phase	Focus	Key Question
0 - Assess	Understand your current state	How far are we from CD?
1 - Foundations	Daily integration, testing, small batches	Can we integrate safely every day?
2 - Pipeline	Automated path from commit to production	Can we deploy any commit automatically?
3 - Optimize	Reduce batch size, limit WIP, measure	Can we deliver small changes quickly?
4 - Deliver on Demand	Deploy any change when the business needs it	Can we deliver any change to production when needed?

Where to Start

If you are unsure where to begin, start with Phase 0: Assess to understand your current state and identify the constraints holding you back.

For Developers - Common pain points developers face before CD adoption
For Managers - How delivery problems appear from a management perspective
Brownfield CD - Migrating an existing system
Greenfield CD - Building CD from day one
FAQ - Frequently asked questions about continuous delivery
DORA Recommended Practices - The research-backed capabilities that drive delivery performance

1 - Phase 0: Assess

Understand where you are today. Map your delivery process, measure what matters, and identify the constraints holding you back.

Key question: “How far are we from CD?”

Before changing anything, you need to understand your current state. This phase helps you create a clear picture of your delivery process, establish baseline metrics, and identify the constraints that will guide your improvement roadmap.

Team activity: The pages in this phase work as a facilitated team exercise. Run Current State Checklist as a retrospective to align on where your delivery process stands today before measuring baselines.

What You’ll Do

Map your value stream - Visualize the flow from idea to production
Establish baseline metrics - Measure your current delivery performance
Identify constraints - Find the bottlenecks limiting your flow
Complete the current-state checklist - Self-assess against MinimumCD practices

Why This Phase Matters

Teams that skip assessment often invest in the wrong improvements. A team with a 3-week manual testing cycle doesn’t need better deployment automation first - they need testing fundamentals. Understanding your constraints ensures you invest effort where it will have the biggest impact.

When You’re Ready to Move On

You’re ready for Phase 1: Foundations when you can answer:

What does our value stream look like end-to-end?
What are our current lead time, deployment frequency, and change failure rate?
What are the top 3 constraints limiting our delivery flow?
Which MinimumCD practices are we missing?

Next: Phase 1 - Foundations - establish the technical and team practices that make CD possible.

For Managers - how to recognize delivery problems from a leadership perspective
Phase 1: Foundations - the next phase after assessment is complete
DORA Recommended Practices - industry-recognized capabilities that underpin delivery performance
Deployment Frequency - one of the key metrics you will baseline in this phase
Lead Time for Changes - the metric that reveals how long changes spend in the system
Infrequent Releases - a common symptom that assessment helps quantify
Systemic Defect Sources - understand where defects originate before you start measuring them.

1.1 - Value Stream Mapping

Visualize your delivery process end-to-end to identify waste and constraints before starting your CD migration.

Phase 0 - Assess

Before you change anything about how your team delivers software, you need to see how it works today. Value Stream Mapping (VSM) is the single most effective tool for making your delivery process visible. It reveals the waiting, the rework, and the handoffs that you have learned to live with but that are silently destroying your flow.

In the context of a CD migration, a value stream map is not an academic exercise. It is the foundation for every decision you will make in the phases ahead. It tells you where your time goes, where quality breaks down, and which constraint to attack first.

What Is a Value Stream Map?

A value stream map is a visual representation of every step required to deliver a change from request to production. For each step, you capture:

Process time - the time someone is actively working on that step
Wait time - the time the work sits idle between steps (in a queue, awaiting approval, blocked on an environment)
Percent Complete and Accurate (%C/A) - the percentage of work arriving at this step that is usable without rework

The ratio of process time to total time (process time + wait time) is your flow efficiency. Most teams are shocked to discover that their flow efficiency is below 15%, meaning that for every hour of actual work, there are nearly six hours of waiting.

Prerequisites

Before running a value stream mapping session, make sure you have:

An established, repeatable process. You are mapping what actually happens, not what should happen. If every change follows a different path, start by agreeing on the current “most common” path.
All stakeholders in the room. You need representatives from every group involved in delivery: developers, testers, operations, security, product, change management. Each person knows the wait times and rework loops in their part of the stream that others cannot see.
A shared understanding of wait time vs. process time. Wait time is when work sits idle. Process time is when someone is actively working. A code review that takes “two days” but involves 30 minutes of actual review has 30 minutes of process time and roughly 15.5 hours of wait time.

Choose Your Mapping Approach

Value stream maps can be built from two directions. Most organizations benefit from starting bottom-up and then combining into a top-down view, but the right choice depends on where your delivery pain is concentrated.

Bottom-Up: Map at the Team Level First

Each delivery team maps its own process independently - from the moment a developer is ready to push a change to the moment that change is running in production. This is the approach described in Document Your Current Process, elevated to a formal value stream map with measured process times, wait times, and %C/A.

When to use bottom-up:

You have multiple teams that each own their own deployment process (or think they do).
Teams have different pain points and different levels of CD maturity.
You want each team to own its improvement work rather than waiting for an organizational initiative.

How it works:

Each team maps its own value stream using the session format described below.
Teams identify and fix their own constraints. Many constraints are local - flaky tests, manual deployment steps, slow code review - and do not require cross-team coordination.
After teams have mapped and improved their own streams, combine the maps to reveal cross-team dependencies. Lay the team-level maps side by side and draw the connections: shared environments, shared libraries, shared approval processes, upstream/downstream dependencies.

The combined view often reveals constraints that no single team can see: a shared staging environment that serializes deployments across five teams, a security review team that is the bottleneck for every release, or a shared library with a release cycle that blocks downstream teams for weeks.

Advantages: Fast to start, builds team ownership, surfaces team-specific friction that a high-level map would miss. Teams see results quickly, which builds momentum for the harder cross-team work.

Top-Down: Map Across Dependent Teams

Start with the full flow from a customer request (or business initiative) entering the system to the delivered outcome in production, mapping across every team the work touches. This produces a single map that shows the end-to-end flow including all inter-team handoffs, shared queues, and organizational boundaries.

When to use top-down:

Delivery pain is concentrated at the boundaries between teams, not within them.
A single change routinely touches multiple teams (front-end, back-end, platform, data, etc.) and the coordination overhead dominates cycle time.
Leadership needs a full picture of organizational delivery performance to prioritize investment.

How it works:

Identify a representative value stream - a type of work that flows through the teams you want to map. For example: “a customer-facing feature that requires API changes, a front-end update, and a database migration.”
Get representatives from every team in the room. Each person maps their team’s portion of the flow, including the handoff to the next team.
Connect the segments. The gaps between teams - where work queues, waits for prioritization, or gets lost in a ticket system - are usually the largest sources of delay.

Advantages: Reveals organizational constraints that team-level maps cannot see. Shows the true end-to-end lead time including inter-team wait times. Essential for changes that require coordinated delivery across multiple teams.

Combining Both Approaches

The most effective strategy for large organizations:

Start bottom-up. Have each team document its current process and then run its own value stream mapping session. Fix team-level quick wins immediately.
Combine into a top-down view. Once team-level maps exist, connect them to see the full organizational flow. The team-level detail makes the top-down map more accurate because each segment was mapped by the people who actually do the work.
Fix constraints at the right level. Team-level constraints (flaky tests, manual deploys) are fixed by the team. Cross-team constraints (shared environments, approval bottlenecks, dependency coordination) are fixed at the organizational level.

This layered approach prevents two common failure modes: mapping at too high a level (which misses team-specific friction) and mapping only at the team level (which misses the organizational constraints that dominate end-to-end lead time).

How to Run the Session

Step 1: Start From Delivery, Work Backward

Begin at the right side of your map - the moment a change reaches production. Then work backward through every step until you reach the point where a request enters the system. This prevents teams from getting bogged down in the early stages and never reaching the deployment process, which is often where the largest delays hide.

Typical steps you will uncover include:

Request intake and prioritization
Story refinement and estimation
Development (coding)
Code review
Build and unit tests
Integration testing
Manual QA / regression testing
Security review
Staging deployment
User acceptance testing (UAT)
Change advisory board (CAB) approval
Production deployment
Production verification

Step 2: Capture Process Time and Wait Time for Each Step

For each step on the map, record the process time and the wait time. Use averages if exact numbers are not available, but prefer real data from your issue tracker, CI system, or deployment logs when you can get it.

Migration Tip

Pay close attention to these migration-critical delays:

Handoffs that block flow - Every time work passes from one team or role to another (dev to QA, QA to ops, ops to security), there is a queue. Count the handoffs. Each one is a candidate for elimination or automation.
Manual gates - CAB approvals, manual regression testing, sign-off meetings. These often add days of wait time for minutes of actual value.
Environment provisioning delays - If developers wait hours or days for a test environment, that is a constraint you will need to address in Phase 2.
Rework loops - Any step where work frequently bounces back to a previous step. Track the percentage of times this happens. These loops are destroying your cycle time.

Step 3: Calculate %C/A at Each Step

Percent Complete and Accurate measures the quality of the handoff. Ask each person: “What percentage of the work you receive from the previous step is usable without needing clarification, correction, or rework?”

A low %C/A at a step means the upstream step is producing defective output. This is critical information for your migration plan because it tells you where quality needs to be built in rather than inspected after the fact.

Step 4: Identify Constraints (Kaizen Bursts)

Mark the steps with the largest wait times and the lowest %C/A with a “kaizen burst” - a starburst symbol indicating an improvement opportunity. These are your constraints. They will become the focus of your migration roadmap.

Common constraints teams discover during their first value stream map:

Constraint	Typical Impact	Migration Phase to Address
Long-lived feature branches	Days of integration delay, merge conflicts	Phase 1 (Trunk-Based Development)
Manual regression testing	Days to weeks of wait time	Phase 1 (Testing Fundamentals)
Environment provisioning	Hours to days of wait time	Phase 2 (Production-Like Environments)
CAB / change approval boards	Days of wait time per deployment	Phase 2 (Pipeline Architecture)
Manual deployment process	Hours of process time, high error rate	Phase 2 (Single Path to Production)
Large batch releases	Weeks of accumulation, high failure rate	Phase 3 (Small Batches)

Reading the Results

Once your map is complete, calculate these summary numbers:

Total lead time = sum of all process times + all wait times
Total process time = sum of just the process times
Flow efficiency = total process time / total lead time * 100
Number of handoffs = count of transitions between different teams or roles
Rework percentage = percentage of changes that loop back to a previous step

These numbers become part of your baseline metrics and feed directly into your work to identify constraints.

What Good Looks Like

You are not aiming for a perfect value stream map. You are aiming for a shared, honest picture of reality that the whole team agrees on. The map should be:

Visible - posted on a wall or in a shared digital tool where the team sees it daily
Honest - reflecting what actually happens, including the workarounds and shortcuts
Actionable - with constraints clearly marked so the team knows where to focus

You will revisit and update this map as you progress through each migration phase. It is a living document, not a one-time exercise.

Next Step

With your value stream map in hand, proceed to Baseline Metrics to quantify your current delivery performance.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Slow Pipelines - a flow symptom that value stream mapping often quantifies
No Fast Feedback - a symptom frequently revealed by long wait times on the map
Coordinated Deployments - a deployment symptom visible as cross-team handoffs in the value stream
Hardening Sprints - a symptom that appears as a large testing phase on the map
Development Cycle Time - a metric that value stream mapping helps you measure
Identify Constraints - the next step that uses your value stream map to find the biggest bottleneck

1.2 - Baseline Metrics

Establish baseline measurements for your current delivery performance before making any changes.

Phase 0 - Assess

You cannot improve what you have not measured. Before making any changes to your delivery process, you need to capture baseline measurements of your current performance. These baselines serve two purposes: they help you identify where to focus your migration effort, and they give you an honest “before” picture so you can demonstrate progress as you improve.

This is not about building a sophisticated metrics dashboard. It is about getting four numbers written down so you have a starting point.

Why Measure Before Changing

Teams that skip baseline measurement fall into predictable traps:

They cannot prove improvement. Six months into a migration, leadership asks “What has gotten better?” Without a baseline, the answer is a shrug and a feeling.
They optimize the wrong thing. Without data, teams default to fixing what is most visible or most annoying rather than what is the actual constraint.
They cannot detect regression. A change that feels like an improvement may actually make things worse in ways that are not immediately obvious.

Baselines do not need to be precise to the minute. A rough but honest measurement is vastly more useful than no measurement at all.

The Four Essential Metrics

The DORA research program (now part of Google Cloud) identified four key metrics that predict software delivery performance and organizational outcomes. These are the metrics you should baseline first.

1. Deployment Frequency

What it measures: How often your team deploys to production.

How to capture it: Count the number of production deployments in the last 30 days. Check your deployment logs, pipeline system, or change management records. If deployments are rare enough that you remember each one, count from memory.

What it tells you:

Frequency	What It Suggests
Multiple times per day	You may already be practicing continuous delivery
Once per week	You have a regular cadence but likely batch changes
Once per month or less	Large batches, high risk per deployment, likely manual process
Varies wildly	No consistent process; deployments are event-driven

Record your number: ______ deployments in the last 30 days.

2. Lead Time for Changes

What it measures: The elapsed time from when code is committed to when it is running in production.

How to capture it: Pick your last 5-10 production deployments. For each one, find the commit timestamp of the oldest change included in that deployment and subtract it from the deployment timestamp. Take the median.

If your team uses feature branches, the clock starts at the first commit on the branch, not when the branch is merged. This captures the true elapsed time the change spent in the system.

What it tells you:

Lead Time	What It Suggests
Less than 1 hour	Fast flow, likely small batches and good automation
1 day to 1 week	Reasonable but with room for improvement
1 week to 1 month	Significant queuing, likely large batches or manual gates
More than 1 month	Major constraints in testing, approval, or deployment

Record your number: ______ median lead time for changes.

3. Change Failure Rate

What it measures: The percentage of deployments to production that result in a degraded service requiring remediation (rollback, hotfix, patch, or incident).

How to capture it: Look at your last 20-30 production deployments. Count how many caused an incident, required a rollback, or needed an immediate hotfix. Divide by the total number of deployments.

What it tells you:

Failure Rate	What It Suggests
0-5%	Strong quality practices and small change sets
5-15%	Typical for teams with some automation
15-30%	Quality gaps, likely insufficient testing or large batches
Above 30%	Systemic quality problems; changes are frequently broken

Record your number: ______ % of deployments that required remediation.

4. Mean Time to Restore (MTTR)

What it measures: How long it takes to restore service after a production failure caused by a deployment.

How to capture it: Look at your production incidents from the last 3-6 months. For each incident caused by a deployment, note the time from detection to resolution. Take the median. If you have not had any deployment-caused incidents, note that - it either means your quality is excellent or your deployment frequency is so low that you have insufficient data.

What it tells you:

MTTR	What It Suggests
Less than 1 hour	Good incident response, likely automated rollback
1-4 hours	Manual but practiced recovery process
4-24 hours	Significant manual intervention required
More than 1 day	Serious gaps in observability or rollback capability

Record your number: ______ median time to restore service.

Capturing Your Baselines

You do not need specialized tooling to capture these four numbers. Here is a practical approach:

Check your pipeline system. Most pipeline tools (Jenkins, GitHub Actions, GitLab CI, Azure DevOps) have deployment history. Export the last 30-90 days of deployment records.
Check your incident tracker. Pull incidents from the last 3-6 months and filter for deployment-caused issues.
Check your version control. Git log data combined with deployment timestamps gives you lead time.
Ask the team. If data is scarce, have a conversation with the team. Experienced team members can provide reasonable estimates for all four metrics.

Record these numbers somewhere the whole team can see them. A wiki page, a whiteboard, a shared document - the format does not matter. What matters is that they are written down and dated.

What About Automation?

If you already have a pipeline system that tracks deployments, you can extract most of these numbers programmatically. But do not let the pursuit of automation delay your baseline. A spreadsheet with manually gathered numbers is perfectly adequate for Phase 0. You will build more sophisticated measurement into your pipeline in Phase 2.

What Your Baselines Tell You About Where to Focus

Your baseline metrics point toward specific constraints:

Signal	Likely Constraint	Where to Look
Low deployment frequency + high lead time	Large batches, manual process	Value Stream Map for queue times
High change failure rate	Insufficient testing, poor quality practices	Testing Fundamentals
High MTTR	No rollback capability, poor observability	Rollback
High lead time + low change failure rate	Excessive manual gates adding delay but not value	Identify Constraints

Use these signals alongside your value stream map to identify your top constraints.

A Warning About Metrics

Goodhart's Law

“When a measure becomes a target, it ceases to be a good measure.”

These metrics are diagnostic tools, not performance targets. The moment you use them to compare teams, rank individuals, or set mandated targets, people will optimize for the metric rather than for actual delivery improvement. A team can trivially improve their deployment frequency number by deploying empty changes, or reduce their change failure rate by never deploying anything risky.

Use these metrics within the team, for the team. Share trends with leadership if needed, but never publish team-level metrics as a leaderboard. The goal is to help each team understand their own delivery health, not to create competition.

Next Step

With your baselines recorded, proceed to Identify Constraints to determine which bottleneck to address first.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Deployment Frequency - detailed guidance on measuring how often you deploy
Lead Time for Changes - how to measure the time from commit to production
Change Failure Rate - measuring the percentage of deployments that cause failures
Mean Time to Restore - measuring recovery speed after production incidents
DORA Recommended Practices - the research-backed capabilities these metrics reflect
Infrequent Releases - a symptom that low deployment frequency baselines often reveal

1.3 - Identify Constraints

Use your value stream map and baseline metrics to find the bottlenecks that limit your delivery flow.

Phase 0 - Assess

Your value stream map shows you where time goes. Your baseline metrics tell you how fast and how safely you deliver. Now you need to answer the most important question in your migration: What is the one thing most limiting your delivery flow right now?

This is not a question you answer by committee vote or gut feeling. It is a question you answer with the data you have already collected.

The Theory of Constraints

Eliyahu Goldratt’s Theory of Constraints offers a simple and powerful insight: every system has exactly one constraint that limits its overall throughput. Improving anything other than that constraint does not improve the system.

Consider a delivery process where code review takes 30 minutes but the queue to get a review takes 2 days, and manual regression testing takes 5 days after that. If you invest three months building a faster build pipeline that saves 10 minutes per build, you have improved something that is not the constraint. The 5-day regression testing cycle still dominates your lead time. You have made a non-bottleneck more efficient, which changes nothing about how fast you deliver.

The implication for your CD migration is direct: you must find and address constraints in order of impact. Fix the biggest one first. Then find the next one. Then fix that. This is how you make sustained, measurable progress rather than spreading effort across improvements that do not move the needle.

Common Constraint Categories

Software delivery constraints tend to cluster into a few recurring categories. As you review your value stream map, look for these patterns.

Testing Bottlenecks

Symptoms: Large wait time between “code complete” and “verified.” Manual regression test cycles measured in days or weeks. Low %C/A at the testing step, indicating frequent rework. High change failure rate in your baseline metrics despite significant testing effort.

What is happening: Testing is being done as a phase after development rather than as a continuous activity during development. Manual test suites have grown to cover every scenario ever encountered, and running them takes longer with every release. The test environment is shared and frequently broken.

Migration path: Phase 1 - Testing Fundamentals

Deployment Gates

Symptoms: Wait times of days or weeks between “tested” and “deployed.” Change Advisory Board (CAB) meetings that happen weekly or biweekly. Multiple sign-offs required from people who are not involved in the actual change.

What is happening: The organization has substituted process for confidence. Because deployments have historically been risky (large batches, manual processes, poor rollback), layers of approval have been added. These approvals add delay but rarely catch issues that automated testing would not. They exist because the deployment process is not trustworthy, and they persist because removing them feels dangerous.

Migration path: Phase 2 - Pipeline Architecture and building the automated quality evidence that makes manual approvals unnecessary.

Environment Provisioning

Symptoms: Developers waiting hours or days for a test or staging environment. “Works on my machine” failures when code reaches a shared environment. Environments that drift from production configuration over time.

What is happening: Environments are manually provisioned, shared across teams, and treated as pets rather than cattle. There is no automated way to create a production-like environment on demand. Teams queue for shared environments, and environment configuration has diverged from production.

Migration path: Phase 2 - Production-Like Environments

Code Review Delays

Symptoms: Pull requests sitting open for more than a day. Review queues with 5 or more pending reviews. Developers context-switching because they are blocked waiting for review.

What is happening: Code review is being treated as an asynchronous handoff rather than a collaborative activity. Reviews happen when the reviewer “gets to it” rather than as a near-immediate response. Large pull requests make review daunting, which increases queue time further.

Migration path: Phase 1 - Code Review and Trunk-Based Development to reduce branch lifetime and review size.

Manual Handoffs

Symptoms: Multiple steps in your value stream map where work transitions from one team to another. Tickets being reassigned across teams. “Throwing it over the wall” language in how people describe the process.

What is happening: Delivery is organized as a sequence of specialist stages (dev, test, ops, security) rather than as a cross-functional flow. Each handoff introduces a queue, a context loss, and a communication overhead. The more handoffs, the longer the lead time and the more likely that information is lost.

Migration path: This is an organizational constraint, not a technical one. It is addressed gradually through cross-functional team formation and by automating the specialist activities into the pipeline so that handoffs become automated checks rather than manual transfers.

Using Your Value Stream Map to Find the Constraint

Pull out your value stream map and follow this process:

Step 1: Rank Steps by Wait Time

List every step in your value stream and sort them by wait time, longest first. Your biggest constraint is almost certainly in the top three. Wait time is more important than process time because wait time is pure waste - nothing is happening, no value is being created.

Step 2: Look for Rework Loops

Identify steps where work frequently loops back. A testing step with a 40% rework rate means that nearly half of all changes go through the development-to-test cycle twice. The effective wait time for that step is nearly doubled when you account for rework.

Step 3: Count Handoffs

Each handoff between teams or roles is a queue point. If your value stream has 8 handoffs, you have 8 places where work waits. Look for handoffs that could be eliminated by automation or by reorganizing work within the team.

Step 4: Cross-Reference with Metrics

Check your findings against your baseline metrics:

High lead time with low process time = the constraint is in the queues (wait time), not in the work itself
High change failure rate = the constraint is in quality practices, not in speed
Low deployment frequency with everything else reasonable = the constraint is in the deployment process itself or in organizational policy

Prioritizing: Fix the Biggest One First

One Constraint at a Time

Resist the temptation to tackle multiple constraints simultaneously. The Theory of Constraints is clear: improving a non-bottleneck does not improve the system. Identify the single biggest constraint, focus your migration effort there, and only move to the next constraint when the first one is no longer the bottleneck.

This does not mean the entire team works on one thing. It means your improvement initiatives are sequenced to address constraints in order of impact.

Once you have identified your top constraint, map it to a migration phase:

If Your Top Constraint Is…	Start With…
Integration and merge conflicts	Phase 1 - Trunk-Based Development
Manual testing cycles	Phase 1 - Testing Fundamentals
Large work items that take weeks	Phase 1 - Work Decomposition
Code review bottlenecks	Phase 1 - Code Review
Manual or inconsistent deployments	Phase 2 - Single Path to Production
Environment availability	Phase 2 - Production-Like Environments
Change approval processes	Phase 2 - Pipeline Architecture
Large batch sizes	Phase 3 - Small Batches

The Next Constraint

Fixing your first constraint will improve your flow. It will also reveal the next constraint. This is expected and healthy. A delivery process is a chain, and strengthening the weakest link means a different link becomes the weakest.

This is why the migration is organized in phases. Phase 1 addresses the foundational constraints that nearly every team has (integration practices, testing, small work). Phase 2 addresses pipeline constraints. Phase 3 optimizes flow. You will cycle through constraint identification and resolution throughout your migration.

Plan to revisit your value stream map and metrics after addressing each major constraint. Your map from today will be outdated within weeks of starting your migration - and that is a sign of progress.

Next Step

Complete the Current State Checklist to assess your team against specific MinimumCD practices and confirm your migration starting point.

Work Items Take Too Long - a flow symptom often traced back to the constraints this guide helps identify
Too Much WIP - a symptom that constraint analysis frequently uncovers
Unbounded WIP - an anti-pattern that shows up as a queue constraint in your value stream
CAB Gates - an organizational anti-pattern that commonly surfaces as a deployment gate constraint
Monolithic Work Items - an anti-pattern that increases lead time by inflating batch size
Value Stream Mapping - the prerequisite exercise that produces the data this guide analyzes

1.4 - Current State Checklist

Self-assess your team against MinimumCD practices to understand your starting point and determine where to begin your migration.

Phase 0 - Assess

This checklist translates the practices defined by MinimumCD.org into concrete yes-or-no questions you can answer about your team today. It is not a test to pass. It is a diagnostic tool that shows you which practices are already in place and which ones your migration needs to establish.

Work through each category with your team. Be honest - checking a box you have not earned gives you a migration plan that skips steps you actually need.

How to Use This Checklist

For each item, mark it with an [x] if your team consistently does this today - not occasionally, not aspirationally, but as a default practice. If you do it sometimes but not reliably, leave it unchecked.

Trunk-Based Development

All developers integrate their work to the trunk (main branch) at least once every 24 hours
No branch lives longer than 24 hours before being integrated
The team does not use code freeze periods to stabilize for release
There are fewer than 3 active branches at any given time
Merge conflicts are rare and small when they occur

Why it matters: Long-lived branches are the single biggest source of integration risk. Every hour a branch lives is an hour where it diverges from what everyone else is doing. Trunk-based development eliminates integration as a separate, painful event and makes it a continuous, trivial activity. Without this practice, continuous integration is impossible, and without continuous integration, continuous delivery is impossible.

Continuous Integration

Every commit to trunk triggers an automated build
The automated build includes running the full unit test suite
All tests must pass before any change is merged to trunk
A broken build is treated as the team’s top priority to fix (not left broken while other work continues)
The build and test cycle completes in less than 10 minutes

Why it matters: Continuous integration means that the team always knows whether the codebase is in a working state. If builds are not automated, if tests do not run on every commit, or if broken builds are tolerated, then the team is flying blind. Every change is a gamble that something else has not broken in the meantime.

Pipeline Practices

There is a single, defined path that every change follows to reach production (no side doors, no manual deployments, no exceptions)
The pipeline is deterministic: given the same input commit, it produces the same output every time
Build artifacts are created once and promoted through environments (not rebuilt for each environment)
The pipeline runs automatically on every commit to trunk without manual triggering
Pipeline failures provide clear, actionable feedback that developers can act on within minutes

Why it matters: A pipeline is the mechanism that turns code changes into production deployments. If the pipeline is inconsistent, manual, or bypassable, then you do not have a reliable path to production. You have a collection of scripts and hopes. Deterministic, automated pipelines are what make deployment a non-event rather than a high-risk ceremony.

Deployment

The team has at least one environment that closely mirrors production configuration (OS, middleware, networking, data shape)
Application configuration is externalized from the build artifact (config files, environment variables, or a config service - not baked into the binary)
The team can roll back a production deployment within minutes, not hours
Deployments to production do not require downtime
The deployment process is the same for every environment (dev, staging, production) with only configuration differences

Why it matters: If your test environment does not look like production, your tests are lying to you. If configuration is baked into your artifact, you are rebuilding for each environment, which means the thing you tested is not the thing you deploy. If you cannot roll back quickly, every deployment is a high-stakes bet. These practices ensure that what you test is what you ship, and that shipping is safe.

Quality

The team has automated tests at multiple levels (unit, integration, and at least some end-to-end)
A build that passes all automated checks is considered deployable without additional manual verification
There are no manual quality gates between a green build and production (no manual QA sign-off, no manual regression testing required)
Defects found in production are addressed by adding automated tests that would have caught them, not by adding manual inspection steps
The team monitors production health and can detect deployment-caused issues within minutes

Why it matters: Quality that depends on manual inspection does not scale and does not speed up. As your deployment frequency increases through the migration, manual quality gates become the bottleneck. The goal is to build quality in through automation so that a green build means a deployable build. This is the foundation of continuous delivery: if it passes the pipeline, it is ready for production.

Scoring Guide

Count the number of items you checked across all categories.

Score	Your Starting Point	Recommended Phase
0-5	You are early in your journey. Most foundational practices are not yet in place.	Start at the beginning of Phase 1 - Foundations. Focus on trunk-based development and basic test automation first.
6-12	You have some practices in place but significant gaps remain. This is the most common starting point.	Start with Phase 1 - Foundations but focus on the categories where you had the fewest checks. Your constraint analysis will tell you which gap to close first.
13-18	Your foundations are solid. The gaps are likely in pipeline automation and deployment practices.	You may be able to move quickly through Phase 1 and focus your effort on Phase 2 - Pipeline. Validate with your value stream map that your remaining constraints match.
19-22	You are well-practiced in most areas. Your migration is about closing specific gaps and optimizing flow.	Review your unchecked items - they point to specific topics in Phase 3 - Optimize or Phase 4 - Deliver on Demand.
23-25	You are already practicing most of what MinimumCD defines. Your focus should be on consistency and delivering on demand.	Jump to Phase 4 - Deliver on Demand and focus on the capability to deploy any change when needed.

A Score Is Not a Grade

This checklist exists to help your team find its starting point, not to judge your team’s competence. A score of 5 does not mean your team is failing - it means your team has a clear picture of what to work on. A score of 22 does not mean you are done - it means your remaining gaps are specific and targeted.

The only wrong answer is a dishonest one.

Putting It All Together

You now have four pieces of information from Phase 0:

A value stream map showing your end-to-end delivery process with wait times and rework loops
Baseline metrics for deployment frequency, lead time, change failure rate, and MTTR
An identified top constraint telling you where to focus first
This checklist confirming which practices are in place and which are missing

Together, these give you a clear, data-informed starting point for your migration. You know where you are, you know what is slowing you down, and you know which practices to establish first.

Next Step

You are ready to begin Phase 1 - Foundations. Start with the practice area that addresses your top constraint.

Painful Merges - a symptom indicating trunk-based development practices are missing
Fear of Deploying - a symptom that often correlates with unchecked deployment practices
Slow Test Suites - a symptom that surfaces when automated testing practices are immature
Manual Regression Testing Gates - an anti-pattern the Quality section of this checklist helps identify
Missing Deployment Pipeline - an anti-pattern the Pipeline Practices section helps detect
Phase 1: Foundations - where to begin after completing your assessment

2 - Phase 1: Foundations

Establish the essential practices for daily integration, testing, and small work decomposition.

Key question: “Can we integrate safely every day?”

This phase establishes the development practices that make continuous delivery possible. Without these foundations, pipeline automation just speeds up a broken process.

What You’ll Do

Adopt trunk-based development - Integrate to trunk at least daily
Build testing fundamentals - Create a fast, reliable test suite
Automate your build - One command to build, test, and package
Decompose work - Break features into small, deliverable increments
Streamline code review - Fast, effective review that doesn’t block flow
Establish working agreements - Shared definitions of done and ready
Everything as code - Infrastructure, pipelines, schemas, monitoring, and security policies in version control, delivered through pipelines

Why This Phase Matters

These practices are the prerequisites for everything that follows. Trunk-based development eliminates merge hell. Testing fundamentals give you the confidence to deploy frequently. Small work decomposition reduces risk per change. Together, they create the feedback loops that drive continuous improvement.

When You’re Ready to Move On

You’re ready for Phase 2: Pipeline when:

All developers integrate to trunk at least once per day
Your test suite catches real defects and runs in under 10 minutes
You can build and package your application with a single command
Most work items are completable within 2 days

Next: Phase 2 - Pipeline - build a single automated path from commit to production.

Phase 0: Assess - The assessment phase that precedes Foundations
Phase 2: Pipeline - The next phase after establishing foundations
DORA Recommended Practices - Research-backed capabilities that drive delivery performance
No Fast Feedback - Symptom that foundational practices address
Works on My Machine - Symptom eliminated by build automation and testing foundations
Deployment Frequency - Key metric that improves as foundations mature

2.1 - Trunk-Based Development

Integrate all work to the trunk at least once per day to enable continuous integration.

Phase 1 - Foundations

Trunk-based development is the first foundation to establish. Without daily integration to a shared trunk, the rest of the CD migration cannot succeed. This page covers the core practice, two migration paths, and a tactical guide for getting started.

What Is Trunk-Based Development?

Trunk-based development (TBD) is a branching strategy where all developers integrate their work into a single shared branch - the trunk - at least once per day. The trunk is always kept in a releasable state.

This is a non-negotiable prerequisite for continuous delivery. If your team is not integrating to trunk daily, you are not doing CI, and you cannot do CD. There is no workaround.

“If it hurts, do it more often, and bring the pain forward.”
Jez Humble, Continuous Delivery

What TBD Is Not

It is not “everyone commits directly to main with no guardrails.” You still test, review, and validate work - you just do it in small increments.
It is not incompatible with code review. It requires review to happen quickly.
It is not reckless. It is the opposite: small, frequent integrations are far safer than large, infrequent merges.

What Trunk-Based Development Improves

Problem	How TBD Helps
Merge conflicts	Small changes integrated frequently rarely conflict
Integration risk	Bugs are caught within hours, not weeks
Long-lived branches diverge from reality	The trunk always reflects the current state of the codebase
“Works on my branch” syndrome	Everyone shares the same integration point
Slow feedback	CI runs on every integration, giving immediate signal
Large batch deployments	Small changes are individually deployable
Fear of deployment	Each change is small enough to reason about

Two Migration Paths

There are two valid approaches to trunk-based development. Both satisfy the minimum CD requirement of daily integration. Choose the one that fits your team’s current maturity and constraints.

Path 1: Short-Lived Branches

Developers create branches that live for less than 24 hours. Work is done on the branch, reviewed quickly, and merged to trunk within a single day.

How it works:

Pull the latest trunk
Create a short-lived branch
Make small, focused changes
Open a pull request (or use pair programming as the review)
Merge to trunk before end of day
The branch is deleted after merge

Best for teams that:

Currently use long-lived feature branches and need a stepping stone
Have regulatory requirements for traceable review records
Use pull request workflows they want to keep (but make faster)
Are new to TBD and want a gradual transition

Key constraint: The branch must merge to trunk within 24 hours. If it does not, you have a long-lived branch and you have lost the benefit of TBD.

Path 2: Direct Trunk Commits

Developers commit directly to trunk. Quality is ensured through pre-commit checks, pair programming, and strong automated testing.

How it works:

Pull the latest trunk
Make a small, tested change locally
Run the local build and test suite
Push directly to trunk
CI validates the commit immediately

Best for teams that:

Have strong automated test coverage
Practice pair or mob programming (which provides real-time review)
Want maximum integration frequency
Have high trust and shared code ownership

Key constraint: This requires excellent test coverage and a culture where the team owns quality collectively. Without these, direct trunk commits become reckless.

How to Choose Your Path

Ask these questions:

Do you have automated tests that catch real defects? If no, start with Path 1 and invest in testing fundamentals in parallel.
Does your organization require documented review approvals? If yes, use Path 1 with rapid pull requests.
Does your team practice pair programming? If yes, Path 2 may work immediately - pairing is a continuous review process.
How large is your team? Teams of 2-4 can adopt Path 2 more easily. Larger teams may start with Path 1 and transition later.

Both paths are valid. The important thing is daily integration to trunk. Do not spend weeks debating which path to use. Pick one, start today, and adjust.

Essential Supporting Practices

Trunk-based development does not work in isolation. These supporting practices make daily integration safe and sustainable.

Feature Flags

When you integrate to trunk daily, incomplete features will exist on trunk. Feature flags let you merge code that is not yet ready for users.

Simple feature flag example

// Simple feature flag example
if (featureFlags.isEnabled('new-checkout-flow', user)) {
  return newCheckout(cart);
} else {
  return legacyCheckout(cart);
}

Rules for feature flags in TBD:

Use flags to decouple deployment from release
Remove flags within days or weeks - they are temporary by design
Keep flag logic simple; avoid nested or dependent flags
Test both flag states in your automated test suite

When NOT to use feature flags:

New features that can be built and connected in a final commit - use Connect Last instead
Behavior changes that replace existing logic - use Branch by Abstraction instead
New API routes - build the route, expose it as the last change
Bug fixes or hotfixes - deploy immediately without a flag
Simple changes where standard deployment is sufficient

Feature flags are covered in more depth in Phase 3: Optimize.

Evolutionary Coding Practices

The ability to make code changes that are not complete features and integrate them to trunk without breaking existing behavior is a core skill for trunk-based development. You never make big-bang changes. You make small changes that limit risk. Feature flags are one approach, but two other patterns are equally important.

Branch by Abstraction

Branch by abstraction lets you gradually replace existing behavior while continuously integrating to trunk. It works in four steps:

Branch by abstraction - four-step pattern

// Step 1: Create abstraction (integrate to trunk)
class PaymentProcessor {
  process(payment) {
    return this.implementation.process(payment)
  }
}

// Step 2: Add new implementation alongside old (integrate to trunk)
class StripePaymentProcessor {
  process(payment) {
    // New Stripe implementation
  }
}

// Step 3: Switch implementations (integrate to trunk)
const processor = useNewStripe
  ? new StripePaymentProcessor()
  : new LegacyProcessor()

// Step 4: Remove old implementation (integrate to trunk)

Each step is a separate commit that keeps trunk working. The old behavior runs until you explicitly switch, and you can remove the abstraction layer once the migration is complete.

Connect Last

Connect Last means you build all the components of a feature, each individually tested and integrated to trunk, and wire them into the user-visible path only in the final commit.

Connect Last pattern - build components then wire to UI

// Commits 1-10: Build new checkout components (all tested, all integrated)
function CheckoutStep1() { /* tested, working */ }
function CheckoutStep2() { /* tested, working */ }
function CheckoutStep3() { /* tested, working */ }

// Commit 11: Wire up to UI (final integration)
router.get('/checkout', CheckoutStep1);

Because nothing references the new code until the last commit, there is no risk of breaking existing behavior during development.

Which Pattern Should I Use?

Pattern	Best for	Example
Connect Last	New features that do not affect existing code	Building a new checkout flow, adding a new report page
Branch by Abstraction	Replacing or modifying existing behavior	Swapping a payment processor, migrating a data layer
Feature Flags	Gradual rollout, testing in production, or customer-specific features	Dark launches, A/B tests, beta programs

If your change does not touch existing code paths, Connect Last is the simplest option. If you are replacing something that already exists, Branch by Abstraction gives you a safe migration path. Reserve feature flags for cases where you need runtime control over who sees the change.

Commit Small, Commit Often

Each commit should be a small, coherent change that leaves trunk in a working state. If you are committing once a day in a large batch, you are not getting the benefit of TBD.

Guidelines:

Each commit should be independently deployable
A commit should represent a single logical change
If you cannot describe the change in one sentence, it is too big
Target multiple commits per day, not one large commit at end of day

Test-Driven Development (TDD) and ATDD

TDD provides the safety net that makes frequent integration sustainable. When every change is accompanied by tests, you can integrate confidently.

TDD: Write the test before the code. Red, green, refactor.
ATDD (Acceptance Test-Driven Development): Write acceptance criteria as executable tests before implementation.

Both practices ensure that your test suite grows with your code and that trunk remains releasable.

Getting Started: A Tactical Guide

Step 1: Shorten Your Branches

If your team currently uses long-lived feature branches, start by shortening their lifespan.

Current State	Target
Branches live for weeks	Branches live for < 1 week
Merge once per sprint	Merge multiple times per week
Large merge conflicts are normal	Conflicts are rare and small

Action: Set a team agreement that no branch lives longer than 2 days. Track branch age as a metric.

Step 2: Integrate Daily

Tighten the window from 2 days to 1 day.

Action:

Every developer merges to trunk at least once per day, every day they write code
If work is not complete, use a feature flag or other technique to merge safely
Track integration frequency as your primary metric

Step 3: Ensure Trunk Stays Green

Daily integration is only useful if trunk remains in a releasable state.

Action:

Run your test suite on every merge to trunk
If the build breaks, fixing it becomes the team’s top priority
Establish a working agreement: “broken build = stop the line” (see Working Agreements)

Step 4: Remove the Safety Net of Long Branches

Once the team is integrating daily with a green trunk, eliminate the option of long-lived branches.

Action:

Configure branch protection rules to warn or block branches older than 24 hours
Remove any workflow that depends on long-lived branches (e.g., “dev” or “release” branches)
Celebrate the transition - this is a significant shift in how the team works

Key Pitfalls

1. “We integrate daily, but we also keep our feature branches”

If you are merging to trunk daily but also maintaining a long-lived feature branch, you are not doing TBD. The feature branch will diverge, and merging it later will be painful. The integration to trunk must be the only integration point.

2. “Our builds are too slow for frequent integration”

If your CI pipeline takes 30 minutes, integrating multiple times a day feels impractical. This is a real constraint - address it by investing in build automation and parallelizing your test suite. Target a build time under 10 minutes.

3. “We can’t integrate incomplete features to trunk”

Yes, you can. Use feature flags to hide incomplete work from users. The code exists on trunk, but the feature is not active. This is a standard practice at every company that practices CD.

4. “Code review takes too long for daily integration”

If pull request reviews take 2 days, daily integration is impossible. The solution is to change how you review: pair programming provides continuous review, mob programming reviews in real time, and small changes can be reviewed asynchronously in minutes. See Code Review for specific techniques.

5. “What if someone pushes a bad commit to trunk?”

This is why you have automated tests, CI, and the “broken build = top priority” agreement. Bad commits will happen. The question is how fast you detect and fix them. With TBD and CI, the answer is minutes, not days.

Measuring Success

Track these metrics to verify your TBD adoption:

Metric	Target	Why It Matters
Integration frequency	At least 1 per developer per day	Confirms daily integration is happening
Branch age	< 24 hours	Catches long-lived branches
Build duration	< 10 minutes	Enables frequent integration without frustration
Merge conflict frequency	Decreasing over time	Confirms small changes reduce conflicts

Next Step

Once your team is integrating to trunk daily, build the test suite that makes that integration trustworthy. Continue to Testing Fundamentals.

Painful Merges - Symptom eliminated by integrating to trunk daily
Merge Freeze - Symptom caused by long-lived branches and infrequent integration
No Fast Feedback - Symptom that daily integration and CI address directly
Long-Lived Feature Branches - Anti-pattern that TBD replaces
Integration Deferred - Anti-pattern where integration is postponed until late in development
Integration Frequency - Key metric for tracking TBD adoption

2.1.1 - TBD Migration Guide

A tactical guide for migrating from GitFlow or long-lived branches to trunk-based development, covering regulated environments, multi-team coordination, and common pitfalls.

Phase 1 - Foundations

This is a detailed companion to the Trunk-Based Development overview. It covers specific migration paths, regulated environment guidance, multi-team strategies, and concrete scenarios.

Continuous delivery requires continuous integration and CI requires very frequent code integration, at least daily, to the trunk. Doing that either requires trunk-based development or worthless process overhead to do multiple merges to accomplish this. So, if you want CI, you’re not getting there without trunk-based development. However, standing up TBD is not as simple as “collapse all the branches.” CD is a quality process, not just automated code delivery. Trunk-based development is the first step in establishing that quality process and in uncovering the problems in the current process.

GitFlow, and other branching models that use long-lived branches, optimize for isolation to protect working code from untested or poorly tested code. They create the illusion of safety while silently increasing risk through long feedback delays. The result is predictable: painful merges, stale assumptions, and feedback that arrives too late to matter.

TBD reverses that. It optimizes for rapid feedback, smaller changes, and collaborative discovery, the ingredients required for CI and continuous delivery.

This article explains how to move from GitFlow (or any long-lived branch pattern) toward TBD, and what “good” actually looks like along the way.

Why Move to Trunk-Based Development?

Long-lived branches hide problems. TBD exposes them early, when they are cheap to fix.

Think of long-lived branches like storing food in a bunker: it feels safe until you open the door and discover half of it rotting. With TBD, teams check freshness every day.

To do CI, teams need:

Small changes integrated at least daily
Automated tests giving fast, deterministic feedback
A single source of truth: the trunk

If your branches live for more than a day or two, you aren’t doing continuous integration. You’re doing periodic integration at best. True CI requires at least daily integration to the trunk.

The First Step: Stop Letting Work Age

The biggest barrier isn’t tooling. It’s habits.

The first meaningful change is simple:

Stop letting branches live long enough to become problems.

Your first goal isn’t true TBD. It’s shorter-lived branches: changes that live for hours or a couple of days, not weeks.

That alone exposes dependency issues, unclear requirements, and missing tests, which is exactly the point. The pain tells you where improvement is needed.

Before You Start: What to Measure

You cannot improve what you don’t measure. Before changing anything, establish baseline metrics, so you can track actual progress.

Essential Metrics to Track Weekly

Branch Lifetime

Average time from branch creation to merge
Maximum branch age currently open
Target: Reduce average from weeks to days, then to hours

Integration Health

Number of merge conflicts per week
Time spent resolving conflicts
Target: Conflicts should decrease as integration frequency increases

Delivery Speed

Time from commit to production deployment
Number of commits per day reaching production
Target: Decrease time to production, increase deployment frequency

Quality Indicators

Build/test execution time
Test failure rate
Production incidents per deployment
Target: Fast, reliable tests; stable deployments

Work Decomposition

Average pull request size (lines changed)
Number of files changed per commit
Target: Smaller, more focused changes

Start with just two or three of these. Don’t let measurement become its own project.

The goal isn’t perfect data. It’s visibility into whether you’re actually moving in the right direction.

Path 1: Moving from Long-Lived Branches to Short-Lived Branches

When GitFlow habits are deeply ingrained, this is usually the least-threatening first step.

1. Collapse the Branching Model

Stop using:

develop
release branches that sit around for weeks
feature branches lasting a sprint or more

Move toward:

A single main (or trunk)
Temporary branches measured in hours or days

2. Integrate Every Few Days, Then Every Day

Set an explicit working agreement:

“Nothing lives longer than 48 hours.”

Once this feels normal, shorten it:

“Integrate at least once per day.”

If a change is too large to merge within a day or two, the problem isn’t the branching model. The problem is the decomposition of work.

3. Test Before You Code

Branch lifetime shortens when you stop guessing about expected behavior. Bring product, QA, and developers together before coding:

Write acceptance criteria collaboratively
Turn them into executable tests
Then write code to make those tests pass

You’ll discover misunderstandings upfront instead of after a week of coding.

This approach is called Behavior-Driven Development (BDD), a collaborative practice where teams define expected behavior in plain language before writing code. BDD bridges the gap between business requirements and technical implementation by using concrete examples that become executable tests.

Key BDD resources:

Behavior-Driven Development - Dojo Consortium - Comprehensive guide to BDD practices
“Specification by Example” by Gojko Adzic - Foundational text on collaborative specification

How to Run a Three Amigos Session

Participants: Product Owner, Developer, Tester (15-30 minutes per story)

Process:

Product describes the user need and expected outcome
Developer asks questions about edge cases and dependencies
Tester identifies scenarios that could fail
Together, write acceptance criteria as examples

Example:

BDD scenarios for password reset

Feature: User password reset

Scenario: Valid reset request
  Given a user with email "user@example.com" exists
  When they request a password reset
  Then they receive an email with a reset link
  And the link expires after 1 hour

Scenario: Invalid email
  Given no user with email "nobody@example.com" exists
  When they request a password reset
  Then they see "If the email exists, a reset link was sent"
  And no email is sent

Scenario: Expired link
  Given a user has a reset link older than 1 hour
  When they click the link
  Then they see "This reset link has expired"
  And they are prompted to request a new one

These scenarios become your automated acceptance tests before you write any implementation code.

From Acceptance Criteria to Tests

Turn those scenarios into executable tests in your framework of choice:

Acceptance tests for password reset scenarios

// Example using Jest and Supertest
describe('Password Reset', () => {
  it('sends reset email for valid user', async () => {
    await createUser({ email: 'user@example.com' });

    const response = await request(app)
      .post('/password-reset')
      .send({ email: 'user@example.com' });

    expect(response.status).toBe(200);
    expect(emailService.sentEmails).toHaveLength(1);
    expect(emailService.sentEmails[0].to).toBe('user@example.com');
  });

  it('does not reveal whether email exists', async () => {
    const response = await request(app)
      .post('/password-reset')
      .send({ email: 'nobody@example.com' });

    expect(response.status).toBe(200);
    expect(response.body.message).toBe('If the email exists, a reset link was sent');
    expect(emailService.sentEmails).toHaveLength(0);
  });
});

Now you can write the minimum code to make these tests pass. This drives smaller, more focused changes.

4. Invest in Contract Tests

Most merge pain isn’t from your code. It’s from the interfaces between services. Define interface changes early and codify them with provider/consumer contract tests.

This lets teams integrate frequently without surprises.

Path 2: Committing Directly to the Trunk

This is the cleanest and most powerful version of TBD. It requires discipline, but it produces the most stable delivery pipeline and the least drama.

If the idea of committing straight to main makes people panic, that’s a signal about your current testing process, not a problem with TBD.

Note on regulated environments

If you work in a regulated industry with compliance requirements (SOX, HIPAA, FedRAMP, etc.), **Path 1 with short-lived branches** is usually the better choice. Short-lived branches provide the audit trails, separation of duties, and documented approval workflows that regulators expect, while still enabling daily integration. See [TBD in Regulated Environments](#tbd-in-regulated-environments) for detailed guidance on meeting compliance requirements, and [Address Code Review Concerns](#address-code-review-concerns) for how to maintain fast review cycles with short-lived branches.

How to Choose Your Path

Use this rule of thumb:

If your team fears “breaking everything,” start with short-lived branches.
If your team collaborates well and writes tests first, go straight to trunk commits.

Both paths require the same skills:

Smaller work
Better requirements
Shared understanding
Automated tests
A reliable pipeline

The difference is pace.

Essential TBD Practices

These practices apply to both paths, whether you’re using short-lived branches or committing directly to trunk.

Use Feature Flags the Right Way

Feature flags are one of several evolutionary coding practices that allow you to integrate incomplete work safely. Other methods include branch by abstraction and connect-last patterns.

Feature flags are not a testing strategy. They are a release strategy.

Every commit to trunk must:

Build
Test
Deploy safely

Flags let you deploy incomplete work without exposing it prematurely. They don’t excuse poor test discipline.

Start Simple: Boolean Flags

You don’t need a sophisticated feature flag system to start. Begin with environment variables or simple config files.

Simple boolean flag example:

Simple boolean feature flags via environment variables

// config/features.js
module.exports = {
  newCheckoutFlow: process.env.FEATURE_NEW_CHECKOUT === 'true',
  enhancedSearch: process.env.FEATURE_ENHANCED_SEARCH === 'true',
};

// In your code
const features = require('./config/features');

app.get('/checkout', (req, res) => {
  if (features.newCheckoutFlow) {
    return renderNewCheckout(req, res);
  }
  return renderOldCheckout(req, res);
});

This is enough for most TBD use cases.

Testing Code Behind Flags

Critical: You must test both code paths, flag on and flag off.

Testing both flag states - enabled and disabled

describe('Checkout flow', () => {
  describe('with new checkout flow enabled', () => {
    beforeEach(() => {
      features.newCheckoutFlow = true;
    });

    it('shows new checkout UI', () => {
      // Test new flow
    });
  });

  describe('with new checkout flow disabled', () => {
    beforeEach(() => {
      features.newCheckoutFlow = false;
    });

    it('shows legacy checkout UI', () => {
      // Test old flow
    });
  });
});

If you only test with the flag on, you’ll break production when the flag is off.

Two Types of Feature Flags

Feature flags serve two fundamentally different purposes:

Temporary Release Flags (should be removed):

Control rollout of new features
Enable gradual deployment
Allow quick rollback of changes
Test in production before full release
Lifecycle: Created for a release, removed once stable (typically 1-4 weeks)

Permanent Configuration Flags (designed to stay):

User preferences and settings (dark mode, email notifications, etc.)
Customer-specific features (enterprise vs. free tier)
A/B testing and experimentation
Regional or regulatory variations
Operational controls (read-only mode, maintenance mode)
Lifecycle: Part of your product’s configuration system

The distinction matters: Temporary release flags create technical debt if not removed. Permanent configuration flags are part of your feature set and belong in your configuration management system.

Most of the feature flags you create for TBD migration will be temporary release flags that must be removed.

Release Flag Lifecycle Management

Temporary release flags are scaffolding, not permanent architecture.

Every temporary release flag should have:

A creation date
A purpose
An expected removal date
An owner responsible for removal

Track your flags:

Tracking flag metadata for lifecycle management

// flags.config.js
module.exports = {
  flags: [
    {
      name: 'newCheckoutFlow',
      created: '2024-01-15',
      owner: 'checkout-team',
      jiraTicket: 'SHOP-1234',
      removalTarget: '2024-02-15',
      purpose: 'Progressive rollout of redesigned checkout'
    }
  ]
};

Set reminders to remove flags. Permanent flags multiply complexity and slow you down.

When to Remove a Flag

Remove a flag when:

The feature is 100% rolled out and stable
You’re confident you won’t need to roll back
Usually 1-2 weeks after full deployment

Removal process:

Set flag to always-on in code
Deploy and monitor
If stable for 48 hours, delete the conditional logic entirely
Remove the flag from configuration

Common Anti-Patterns to Avoid

Don’t:

Let temporary release flags become permanent (if it’s truly permanent, it should be a configuration option)
Let release flags accumulate without removal
Skip testing both flag states
Use flags to hide broken code
Create flags for every tiny change

Do:

Use release flags for large or risky changes
Remove release flags as soon as the feature is stable
Clearly document whether each flag is temporary (release) or permanent (configuration)
Test both enabled and disabled states
Move permanent feature toggles to your configuration management system

Commit Small and Commit Often

If a change is too large to commit today, split it.

Large commits are failed design upstream, not failed integration downstream.

Use TDD and ATDD to Keep Refactors Safe

Refactoring must not break tests. If it does, you’re testing implementation, not behavior. Behavioral tests are what keep trunk commits safe.

Prioritize Interfaces First

Always start by defining and codifying the contract:

What is the shape of the request?
What is the response?
What error states must be handled?

Interfaces are the highest-risk area. Drive them with tests first. Then work inward.

Getting Started: A Tactical Guide

The initial phase sets the tone. Focus on establishing new habits, not perfection.

Step 1: Team Agreement and Baseline

Hold a team meeting to discuss the migration
Agree on initial branch lifetime limit (start with 48 hours if unsure)
Document current baseline metrics (branch age, merge frequency, build time)
Identify your slowest-running tests
Create a list of known integration pain points
Set up a visible tracker (physical board or digital dashboard) for metrics

Step 2: Test Infrastructure Audit

Focus: Find and fix what will slow you down.

Run your test suite and time each major section
Identify slow tests
Look for:
- Tests with sleeps or arbitrary waits
- Tests hitting external services unnecessarily
- Integration tests that could be contract tests
- Flaky tests masking real issues

Fix or isolate the worst offenders. You don’t need a perfect test suite to start, just one fast enough to not punish frequent integration.

Step 3: First Integrated Change

Pick the smallest possible change:

A bug fix
A refactoring with existing test coverage
A configuration update
Documentation improvement

The goal is to validate your process, not to deliver a feature.

Execute:

Create a branch (if using Path 1) or commit directly (if using Path 2)
Make the change
Run tests locally
Integrate to trunk
Deploy through your pipeline
Observe what breaks or slows you down

Step 4: Retrospective

Gather the team:

What went well:

Did anyone integrate faster than before?
Did you discover useful information about your tests or pipeline?

What hurt:

What took longer than expected?
What manual steps could be automated?
What dependencies blocked integration?

Ongoing commitment:

Adjust branch lifetime limit if needed
Assign owners to top 3 blockers
Commit to integrating at least one change per person

The initial phase won’t feel smooth. That’s expected. You’re learning what needs fixing.

Getting Your Team On Board

Technical changes are easy compared to changing habits and mindsets. Here’s how to build buy-in.

Acknowledge the Fear

When you propose TBD, you’ll hear:

“We’ll break production constantly”
“Our code isn’t good enough for that”
“We need code review on branches”
“This won’t work with our compliance requirements”

These concerns are valid signals about your current system. Don’t dismiss them.

Instead: “You’re right that committing directly to trunk with our current test coverage would be risky. That’s why we need to improve our tests first.”

Start with an Experiment

Don’t mandate TBD for the whole team immediately. Propose a time-boxed experiment:

The Proposal:

“Let’s try this for two weeks with a single small feature. We’ll track what goes well and what hurts. After two weeks, we’ll decide whether to continue, adjust, or stop.”

What to measure during the experiment:

How many times did we integrate?
How long did merges take?
Did we catch issues earlier or later than usual?
How did it feel compared to our normal process?

After two weeks: Hold a retrospective. Let the data and experience guide the decision.

Pair on the First Changes

Don’t expect everyone to adopt TBD simultaneously. Instead:

Identify one advocate who wants to try it
Pair with them on the first trunk-based changes
Let them experience the process firsthand
Have them pair with the next person

Knowledge transfer through pairing works better than documentation.

Address Code Review Concerns

“But we need code review!” Yes. TBD doesn’t eliminate code review.

Options that work:

Pair or mob programming (review happens in real-time)
Commit to trunk, review immediately after, fix forward if issues found
Very short-lived branches (hours, not days) with rapid review SLA
Pairing on code review and review change

The goal is fast feedback, not zero review.

Important

If you're using short-lived branches that must merge within a day or two, asynchronous code review becomes a bottleneck. Even "fast" async reviews with 2-4 hour turnaround create delays: the reviewer reads code, leaves comments, the author reads comments later, makes changes, and the cycle repeats. Each round trip adds hours or days. Instead, use **synchronous code reviews** where the reviewer and author work together in real-time (screen share, pair at a workstation, or mob). This eliminates communication delays through review comments. Questions get answered immediately, changes happen on the spot, and the code merges the same day. If your team can't commit to synchronous reviews or pair/mob programming, you'll struggle to maintain short branch lifetimes.

Handle Skeptics and Blockers

You’ll encounter people who don’t want to change. Don’t force it.

Instead:

Let them observe the experiment from the outside
Share metrics and outcomes transparently
Invite them to pair for one change
Let success speak louder than arguments

Some people need to see it working before they believe it.

Get Management Support

Managers often worry about:

Reduced control
Quality risks
Slower delivery (ironically)

Address these with data:

Show branch age metrics before/after
Track cycle time improvements
Demonstrate faster feedback on defects
Highlight reduced merge conflicts

Frame TBD as a risk reduction strategy, not a risky experiment.

Working in a Multi-Team Environment

Migrating to TBD gets complicated when you depend on teams still using long-lived branches. Here’s how to handle it.

The Core Problem

You want to integrate daily. Your dependency team integrates weekly or monthly. Their API changes surprise you during their big-bang merge.

You can’t force other teams to change. But you can protect yourself.

Strategy 1: Consumer-Driven Contract Tests

Define the contract you need from the upstream service and codify it in tests that run in your pipeline.

Example using Pact:

Consumer-driven contract test using Pact

// Your consumer test
const { pact } = require('@pact-foundation/pact');

describe('User Service Contract', () => {
  it('returns user profile by ID', async () => {
    await provider.addInteraction({
      state: 'user 123 exists',
      uponReceiving: 'a request for user 123',
      withRequest: {
        method: 'GET',
        path: '/users/123',
      },
      willRespondWith: {
        status: 200,
        body: {
          id: 123,
          name: 'Jane Doe',
          email: 'jane@example.com',
        },
      },
    });

    const user = await userService.getUser(123);
    expect(user.name).toBe('Jane Doe');
  });
});

This test runs against your expectations of the API, not the actual service. When the upstream team changes their API, your contract test fails before you integrate their changes.

Share the contract:

Publish your contract to a shared repository
Upstream team runs provider verification against your contract
If they break your contract, they know before merging

Strategy 2: API Versioning with Backwards Compatibility

If you control the shared service:

API versioning for backwards-compatible multi-team integration

// Support both old and new API versions
app.get('/api/v1/users/:id', handleV1Users);
app.get('/api/v2/users/:id', handleV2Users);

// Or use content negotiation
app.get('/api/users/:id', (req, res) => {
  const version = req.headers['api-version'] || 'v1';
  if (version === 'v2') {
    return handleV2Users(req, res);
  }
  return handleV1Users(req, res);
});

Migration path:

Deploy new version alongside old version
Update consumers one by one
After all consumers migrated, deprecate old version
Remove old version after deprecation period

Strategy 3: Strangler Fig Pattern

When you depend on a team that won’t change:

Create an anti-corruption layer between your code and theirs
Define your ideal interface in the adapter
Let the adapter handle their messy API

Strangler fig adapter to isolate a legacy dependency

// Your ideal interface
class UserRepository {
  async getUser(id) {
    // Your clean, typed interface
  }
}

// Adapter that deals with their mess
class LegacyUserServiceAdapter extends UserRepository {
  async getUser(id) {
    const response = await fetch(`https://legacy-service/users/${id}`);
    const messyData = await response.json();

    // Transform their format to yours
    return {
      id: messyData.user_id,
      name: `${messyData.first_name} ${messyData.last_name}`,
      email: messyData.email_address,
    };
  }
}

Now your code depends on your interface, not theirs. When they change, you only update the adapter.

Strategy 4: Feature Toggles for Cross-Team Coordination

When multiple teams need to coordinate a release:

Each team develops behind feature flags
Each team integrates to trunk continuously
Features remain disabled until coordination point
Enable flags in coordinated sequence

This decouples development velocity from release coordination.

When You Can’t Integrate with Dependencies

If upstream dependencies block you from integrating daily:

Short term:

Use contract tests to detect breaking changes early
Create adapters to isolate their changes
Document the integration pain as a business cost

Long term:

Advocate for those teams to adopt TBD
Share your success metrics
Offer to help them migrate

You can’t force other teams to change. But you can demonstrate a better way and make it easier for them to follow.

TBD in Regulated Environments

Regulated industries face legitimate compliance requirements: audit trails, change traceability, separation of duties, and documented approval processes. These requirements often lead teams to believe trunk-based development is incompatible with compliance. This is a misconception.

TBD is about integration frequency, not about eliminating controls. You can meet compliance requirements while still integrating at least daily.

The Compliance Concerns

Common regulatory requirements that seem to conflict with TBD:

Audit Trail and Traceability

Every change must be traceable to a requirement, ticket, or change request
Changes must be attributable to specific individuals
History of what changed, when, and why must be preserved

Separation of Duties

The person who writes code shouldn’t be the person who approves it
Changes must be reviewed before reaching production
No single person should have unchecked commit access

Change Control Process

Changes must follow a documented approval workflow
Risk assessment before deployment
Rollback capability for failed changes

Documentation Requirements

Changes must be documented before implementation
Testing evidence must be retained
Deployment procedures must be repeatable and auditable

Short-Lived Branches: The Compliant Path to TBD

Path 1 from this guide (short-lived branches) directly addresses compliance concerns while maintaining the benefits of TBD.

Short-lived branches mean:

Branches live for hours to 2 days maximum, not weeks or months
Integration happens at least daily
Pull requests are small, focused, and fast to review
Review and approval happen within the branch lifetime

This approach satisfies both regulatory requirements and continuous integration principles.

How Short-Lived Branches Meet Compliance Requirements

Audit Trail:

Every commit references the change ticket:

Commit message referencing compliance ticket

git commit -m "JIRA-1234: Add validation for SSN input

Implements requirement REQ-445 from Q4 compliance review.
Changes limited to user input validation layer."

Modern Git hosting platforms (GitHub, GitLab, Bitbucket) automatically track:

Who created the branch
Who committed each change
Who reviewed and approved
When it merged
Complete diff history

Separation of Duties:

Use pull request workflows:

Developer creates branch from trunk
Developer commits changes (same day)
Second person reviews and approves (within 24 hours)
Automated checks validate (tests, security scans, compliance checks)
Merge to trunk after approval
Automated deployment with gates

This provides stronger separation of duties than long-lived branches because:

Reviews happen while context is fresh
Reviewers can actually understand the small changeset
Automated checks enforce policies consistently

Change Control Process:

Branch protection rules enforce your process:

Example GitHub branch protection rules for trunk

# Example GitHub branch protection for trunk
required_reviews: 1
required_checks:
  - unit-tests
  - security-scan
  - compliance-validation
dismiss_stale_reviews: true
require_code_owner_review: true

This ensures:

No direct commits to trunk (except in documented break-glass scenarios)
Required approvals before merge
Automated validation gates
Audit log of every merge decision

Documentation Requirements:

Pull request templates enforce documentation:

Pull request template for compliance documentation

## Change Description
[Link to Jira ticket]

## Risk Assessment
- [ ] Low risk: Configuration only
- [ ] Medium risk: New functionality, backward compatible
- [ ] High risk: Database migration, breaking change

## Testing Evidence
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed (attach screenshots if UI change)
- [ ] Security scan passed

## Rollback Plan
[How to rollback if this causes issues in production]

What “Short-Lived” Means in Practice

Hours, not days:

Simple bug fixes: 2-4 hours
Small feature additions: 4-8 hours
Refactoring: 1-2 days

Maximum 2 days: If a branch can’t merge within 2 days, the work is too large. Decompose it further or use feature flags to integrate incomplete work safely.

Daily integration requirement: Even if the feature isn’t complete, integrate what you have:

Behind a feature flag if needed
As internal APIs not yet exposed
As tests and interfaces before implementation

Compliance-Friendly Tooling

Modern platforms provide compliance features built-in:

Git Hosting (GitHub, GitLab, Bitbucket):

Immutable audit logs
Branch protection rules
Required approvals
Status check enforcement
Signed commits for authenticity

Pipeline Platforms:

Deployment approval gates
Audit trails of every deployment
Environment-specific controls
Automated compliance checks

Feature Flag Systems:

Change deployment without code deployment
Gradual rollout controls
Instant rollback capability
Audit log of flag changes

Secrets Management:

Vault, AWS Secrets Manager, Azure Key Vault
Audit log of secret access
Rotation policies
Environment isolation

Example: Compliant Short-Lived Branch Workflow

Monday 9 AM: Developer creates branch feature/JIRA-1234-add-audit-logging from trunk.

Monday 9 AM - 2 PM: Developer implements audit logging for user authentication events. Commits reference JIRA-1234. Automated tests run on each commit.

Monday 2 PM: Developer opens pull request:

Title: “JIRA-1234: Add audit logging for authentication events”
Description includes risk assessment, testing evidence, rollback plan
Automated checks run: tests, security scan, compliance validation
Code owner automatically assigned for review

Monday 3 PM: Code owner reviews (5-10 minutes; change is small and focused). Suggests minor improvement.

Monday 3:30 PM: Developer addresses feedback, pushes update.

Monday 4 PM: Code owner approves. All automated checks pass. Developer merges to trunk.

Monday 4:05 PM: Pipeline deploys to staging automatically. Automated smoke tests pass.

Monday 4:30 PM: Deployment gate requires manual approval for production. Tech lead approves based on risk assessment.

Monday 4:35 PM: Automated deployment to production. Audit log captures: what deployed, who approved, when, what checks passed.

Total time: 7.5 hours from branch creation to production.

Full compliance maintained. Full audit trail captured. Daily integration achieved.

When Long-Lived Branches Hide Compliance Problems

Ironically, long-lived branches often create compliance risks:

Stale Reviews: Reviewing a 3-week-old, 2000-line pull request is performative, not effective. Reviewers rubber-stamp because they can’t actually understand the changes.

Integration Risk: Big-bang merges after weeks introduce unexpected behavior. The change that was reviewed isn’t the change that actually deployed (due to merge conflicts and integration issues).

Delayed Feedback: Problems discovered weeks after code was written are expensive to fix and hard to trace to requirements.

Audit Trail Gaps: Long-lived branches often have messy commit history, force pushes, and unclear attribution. The audit trail is polluted.

Regulatory Examples Where Short-Lived Branches Work

Financial Services (SOX, PCI-DSS):

Short-lived branches with required approvals
Automated security scanning on every PR
Separation of duties via required reviewers
Immutable audit logs in Git hosting platform
Feature flags for gradual rollout and instant rollback

Healthcare (HIPAA):

Pull request templates documenting PHI handling
Automated compliance checks for data access patterns
Required security review for any PHI-touching code
Audit logs of deployments
Environment isolation enforced by the pipeline

Government (FedRAMP, FISMA):

Branch protection requiring government code owner approval
Automated STIG compliance validation
Signed commits for authenticity
Deployment gates requiring authority to operate
Complete audit trail from commit to production

The Real Choice

The question isn’t “TBD or compliance.”

The real choice is: compliance theater with long-lived branches and risky big-bang merges, or actual compliance with short-lived branches and safe daily integration.

Short-lived branches provide:

Better audit trails (small, traceable changes)
Better separation of duties (reviewable changes)
Better change control (automated enforcement)
Lower risk (small, reversible changes)
Faster feedback (problems caught early)

That’s not just compatible with compliance. That’s better compliance.

What Will Hurt (At First)

When you migrate to TBD, you’ll expose every weakness you’ve been avoiding:

Slow tests
Unclear requirements
Fragile integration points
Architecture that resists small changes
Gaps in automated validation
Long manual processes in the value stream

This is not a regression. This is the point.

Problems you discover early are problems you can fix cheaply.

Common Pitfalls to Avoid

Teams migrating to TBD often make predictable mistakes. Here’s how to avoid them.

Pitfall 1: Treating TBD as Just a Branch Renaming Exercise

The mistake: Renaming develop to main and calling it TBD.

Why it fails: You’re still doing long-lived feature branches, just with different names. The fundamental integration problems remain.

What to do instead: Focus on integration frequency, not branch names. Measure time-to-merge, not what you call your branches.

Pitfall 2: Merging Daily Without Actually Integrating

The mistake: Committing to trunk every day, but your code doesn’t interact with anyone else’s work. Your tests don’t cover integration points.

Why it fails: You’re batching integration for later. When you finally connect your component to the rest of the system, you discover incompatibilities.

What to do instead: Ensure your tests exercise the boundaries between components. Use contract tests for service interfaces. Integrate at the interface level, not just at the source control level.

Pitfall 3: Skipping Test Investment

The mistake: “We’ll adopt TBD first, then improve our tests later.”

Why it fails: Without fast, reliable tests, frequent integration is terrifying. You’ll revert to long-lived branches because trunk feels unsafe.

What to do instead: Invest in test infrastructure first. Make your slowest tests faster. Fix flaky tests. Only then increase integration frequency.

Pitfall 4: Using Feature Flags as a Testing Escape Hatch

The mistake: “It’s fine to commit broken code as long as it’s behind a flag.”

Why it fails: Untested code is still untested, flag or no flag. When you enable the flag, you’ll discover the bugs you should have caught earlier.

What to do instead: Test both flag states. Flags hide features from users, not from your test suite.

Pitfall 5: Keeping Flags Forever

The mistake: Creating feature flags and never removing them. Your codebase becomes a maze of conditionals.

Why it fails: Every permanent flag doubles your testing surface area and increases complexity. Eventually, no one knows which flags do what.

What to do instead: Set a removal date when creating each flag. Track flags like technical debt. Remove them aggressively once features are stable.

Pitfall 6: Forcing TBD on an Unprepared Team

The mistake: Mandating TBD before the team understands why or how it works.

Why it fails: People resist changes they don’t understand or didn’t choose. They’ll find ways to work around it or sabotage it.

What to do instead: Start with volunteers. Run experiments. Share results. Let success create pull, not push.

Pitfall 7: Ignoring the Need for Small Changes

The mistake: Trying to do TBD while still working on features that take weeks to complete.

Why it fails: If your work naturally takes weeks, you can’t integrate daily. You’ll create work-in-progress commits that don’t add value.

What to do instead: Learn to decompose work into smaller, independently valuable increments. This is a skill that must be developed.

Pitfall 8: No Clear Definition of “Done”

The mistake: Integrating code that “works on my machine” without validating it in a production-like environment.

Why it fails: Integration bugs don’t surface until deployment. By then, you’ve integrated many other changes, making root cause analysis harder.

What to do instead: Define “integrated” as “deployed to a staging environment and validated.” Your pipeline should do this automatically.

Pitfall 9: Treating Trunk as Unstable

The mistake: “Trunk is where we experiment. Stable code goes in release branches.”

Why it fails: If trunk can’t be released at any time, you don’t have CI. You’ve just moved your integration problems to a different branch.

What to do instead: Trunk must always be production-ready. Use feature flags for incomplete work. Fix broken builds immediately.

Pitfall 10: Forgetting That TBD is a Means, Not an End

The mistake: Optimizing for trunk commits without improving cycle time, quality, or delivery speed.

Why it fails: TBD is valuable because it enables fast feedback and low-cost changes. If those aren’t improving, TBD isn’t working.

What to do instead: Measure outcomes, not activities. Track cycle time, defect rates, deployment frequency, and time to restore service.

When to Pause or Pivot

Sometimes TBD migration stalls or causes more problems than it solves. Here’s how to tell if you need to pause and what to do about it.

Signs You’re Not Ready Yet

Red flag 1: Your test suite takes hours to run If developers can’t get feedback in minutes, they can’t integrate frequently. Forcing TBD now will just slow everyone down.

What to do: Pause the TBD migration. Invest 2-4 weeks in making tests faster. Parallelize test execution. Remove or optimize the slowest tests. Resume TBD when feedback takes less than 10 minutes.

Red flag 2: More than half your tests are flaky If tests fail randomly, developers will ignore failures. You’ll integrate broken code without realizing it.

What to do: Stop adding new features. Spend one sprint fixing or deleting flaky tests. Track flakiness metrics. Only resume TBD when you trust your test results.

Red flag 3: Production incidents increased significantly If TBD caused a spike in production issues, something is wrong with your safety net.

What to do: Revert to short-lived branches (48-72 hours) temporarily. Analyze what’s escaping to production. Add tests or checks to catch those issues. Resume direct-to-trunk when the safety net is stronger.

Red flag 4: The team is in constant conflict If people are fighting about the process, frustrated daily, or actively working around it, you’ve lost the team.

What to do: Hold a retrospective. Listen to concerns without defending TBD. Identify the top 3 pain points. Address those first. Resume TBD migration when the team agrees to try again.

Signs You’re Doing It Wrong (But Can Fix It)

Yellow flag 1: Daily commits, but monthly integration You’re committing to trunk, but your code doesn’t connect to the rest of the system until the end.

What to fix: Focus on interface-level integration. Ensure your tests exercise boundaries between components. Use contract tests.

Yellow flag 2: Trunk is broken often If trunk is red more than 5% of the time, something’s wrong with your testing or commit discipline.

What to fix: Make “fix trunk immediately” the top priority. Consider requiring local tests to pass before pushing. Add pre-commit hooks if needed.

Yellow flag 3: Feature flags piling up If you have more than 5 active flags, you’re not cleaning up after yourself.

What to fix: Set a team rule: “For every new flag created, remove an old one.” Dedicate time each sprint to flag cleanup.

How to Pause Gracefully

If you need to pause:

Communicate clearly: “We’re pausing TBD migration for two weeks to fix our test infrastructure. This isn’t abandoning the goal.”
Set a specific resumption date: Don’t let “pause” become “quit.” Schedule a date to revisit.
Fix the blockers: Use the pause to address the specific problems preventing success.
Retrospect and adjust: When you resume, what will you do differently?

Pausing isn’t failure. Pausing to fix the foundation is smart.

What “Good” Looks Like

You know TBD is working when:

Branches live for hours, not days
Developers collaborate early instead of merging late
Product participates in defining behaviors, not just writing stories
Tests run fast enough to integrate frequently
Deployments are boring
You can fix production issues with the same process you use for normal work

When your deployment process enables emergency fixes without special exceptions, you’ve reached the real payoff: lower cost of change, which makes everything else faster, safer, and more sustainable.

Concrete Examples and Scenarios

Theory is useful. Examples make it real. Here are practical scenarios showing how to apply TBD principles.

Scenario 1: Breaking Down a Large Feature

Problem: You need to build a user notification system with email, SMS, and in-app notifications. Estimated: 3 weeks of work.

Old approach (GitFlow): Create a feature/notifications branch. Work for three weeks. Submit a massive pull request. Spend days in code review and merge conflicts.

TBD approach:

First commit: Define notification interface, commit to trunk

Day 1: NotificationService contract

// notifications/NotificationService.js
// Contract: all implementations must provide send(userId, message)
// message shape: { title, body, priority } where priority is 'low', 'normal', or 'high'

class NotificationService {
  async send(userId, message) {
    throw new Error('Not implemented');
  }
}

This compiles but doesn’t do anything yet. That’s fine.

Next commit: Add in-memory implementation for testing

Day 2: InMemoryNotificationService

class InMemoryNotificationService extends NotificationService {
  constructor() {
    super();
    this.notifications = [];
  }

  async send(userId, message) {
    this.notifications.push(message);
  }
}

Now other teams can use the interface in their code and tests.

Then: Implement email notifications behind a feature flag

Days 3-5: EmailNotificationService behind a flag

class EmailNotificationService extends NotificationService {
  async send(userId, message) {
    if (!features.emailNotifications) {
      return; // No-op when disabled
    }
    // Real email sending implementation
  }
}

Commit daily. Deploy. Flag is off in production.

Continue iterating:

Add SMS notifications (same pattern: interface, implementation, feature flag)
Enable email notifications for internal users only
Add in-app notifications
Roll out email and SMS to all users
Remove flags for email once stable

Result: Integrated 12-15 times instead of once. Each integration was small and low-risk.

Scenario 2: Database Schema Change

Problem: You need to split the users.name column into first_name and last_name.

Old approach: Update schema, update all code, deploy everything at once. Hope nothing breaks.

TBD approach (expand-contract pattern):

Step 1: Expand Add new columns without removing the old one:

Step 1: add new columns alongside the old one

ALTER TABLE users ADD COLUMN first_name VARCHAR(255);
ALTER TABLE users ADD COLUMN last_name VARCHAR(255);

Commit and deploy. Application still uses name column. No breaking change.

Step 2: Dual writes Update write path to populate both old and new columns:

Step 2: write to both old and new columns

async function createUser(name) {
  const [firstName, lastName] = name.split(' ');
  await db.query(
    'INSERT INTO users (name, first_name, last_name) VALUES (?, ?, ?)',
    [name, firstName, lastName]
  );
}

Commit and deploy. Now new data populates both formats.

Step 3: Backfill Migrate existing data in the background:

Step 3: backfill existing rows

async function backfillNames() {
  const users = await db.query('SELECT id, name FROM users WHERE first_name IS NULL');
  for (const user of users) {
    const [firstName, lastName] = user.name.split(' ');
    await db.query(
      'UPDATE users SET first_name = ?, last_name = ? WHERE id = ?',
      [firstName, lastName, user.id]
    );
  }
}

Run this as a background job. Commit and deploy.

Step 4: Read from new columns Update read path behind a feature flag:

Step 4: read from new columns behind a flag

async function getUser(id) {
  const user = await db.query('SELECT * FROM users WHERE id = ?', [id]);
  if (features.useNewNameColumns) {
    return {
      firstName: user.first_name,
      lastName: user.last_name,
    };
  }
  return { name: user.name };
}

Deploy and gradually enable the flag.

Step 5: Contract Once all reads use new columns and flag is removed:

Step 5: drop the old column

ALTER TABLE users DROP COLUMN name;

Result: Five deployments instead of one big-bang change. Each step was reversible. Zero downtime.

Scenario 3: Refactoring Without Breaking the World

Problem: Your authentication code is a mess. You want to refactor it without breaking production.

TBD approach:

Characterization tests Write tests that capture current behavior (warts and all):

Characterization tests for existing auth behavior

describe('Current auth behavior', () => {
  it('accepts password with special characters', () => {
    // Document what currently happens
  });

  it('handles malformed tokens by returning 401', () => {
    // Capture edge case behavior
  });
});

These tests document how the system actually works. Commit.

Strangler fig pattern Create new implementation alongside old one:

Strangler fig - new implementation alongside old

class LegacyAuthService {
  // Existing messy code (don't touch it)
}

class ModernAuthService {
  // Clean implementation
}

class AuthServiceRouter {
  constructor(legacy, modern) {
    this.legacy = legacy;
    this.modern = modern;
  }

  async authenticate(credentials) {
    if (features.modernAuth) {
      return this.modern.authenticate(credentials);
    }
    return this.legacy.authenticate(credentials);
  }
}

Commit with flag off. Old behavior unchanged.

Migrate piece by piece Enable modern auth for one endpoint at a time:

Enable modern auth per endpoint

if (features.modernAuth && endpoint === '/api/users') {
  return modernAuth.authenticate(credentials);
}

Commit daily. Monitor each endpoint.

Remove old code Once all endpoints use modern auth and it has been stable:

Remove the legacy implementation

class AuthService {
  async authenticate(credentials) {
    // Just the modern implementation
  }
}

Delete the legacy code entirely.

Result: Continuous refactoring without a “big rewrite” branch. Production was never at risk.

Scenario 4: Working with External API Changes

Problem: A third-party API you depend on is changing their response format next month.

TBD approach:

Adapter pattern Create an adapter that normalizes both old and new formats:

Adapter handling both old and new API formats

class PaymentAPIAdapter {
  async getPaymentStatus(orderId) {
    const response = await fetch(`https://api.payments.com/orders/${orderId}`);
    const data = await response.json();

    // Handle both old and new format
    if (data.payment_status) {
      // Old format
      return {
        status: data.payment_status,
        amount: data.total_amount,
      };
    } else {
      // New format
      return {
        status: data.status.payment,
        amount: data.amounts.total,
      };
    }
  }
}

Commit. Your code now works with both formats.

After the API migration: Simplify adapter to only handle new format:

Simplified adapter for new format only

async getPaymentStatus(orderId) {
  const response = await fetch(`https://api.payments.com/orders/${orderId}`);
  const data = await response.json();
  return {
    status: data.status.payment,
    amount: data.amounts.total,
  };
}

Result: No coupling between your deployment schedule and the external API migration. Zero downtime.

References and Further Reading

Trunk-Based Development

Core Resources:

trunkbaseddevelopment.com - Comprehensive guide by Paul Hammant
“Continuous Delivery” by Jez Humble and David Farley - Foundational text on CD practices
Martin Fowler on Feature Toggles - Deep dive into feature flag patterns

Testing Practices

ATDD and BDD:

“Specification by Example” by Gojko Adzic - Collaborative test writing
“The Cucumber Book” by Matt Wynne and Aslak Hellesoy - Practical BDD guide
Three Amigos sessions - Collaborative requirements discovery

Test-Driven Development:

“Test-Driven Development: By Example” by Kent Beck - TDD fundamentals
“Growing Object-Oriented Software, Guided by Tests” by Steve Freeman and Nat Pryce - TDD at scale

Contract Testing:

Pact Documentation - Consumer-driven contract testing
Spring Cloud Contract - For JVM ecosystems

Patterns for Incremental Change

Database Migrations:

“Refactoring Databases” by Scott Ambler and Pramod Sadalage - Expand-contract pattern
Evolutionary Database Design - Martin Fowler

Legacy Code:

“Working Effectively with Legacy Code” by Michael Feathers - Characterization tests and strangler patterns
Strangler Fig Application - Incremental rewrites

Team Dynamics and Change Management

“Accelerate” by Nicole Forsgren, Jez Humble, and Gene Kim - Data on what drives software delivery performance
“Team Topologies” by Matthew Skelton and Manuel Pais - Organizing teams for fast flow
State of DevOps Reports - Annual research on delivery practices

Continuous Integration

“Continuous Integration: Improving Software Quality and Reducing Risk” by Paul Duvall
ThoughtWorks on CI - Foundational practices
Continuous Delivery Foundation - Community and standards

Communities and Discussions

DevOps subreddit - Practitioner discussions
Continuous Delivery Slack - Active community
Software Engineering Stack Exchange - Q&A on practices

Final Thought

Migrating from GitFlow to TBD isn’t a matter of changing your branching strategy. It’s a matter of changing your thinking.

Stop optimizing for isolation. Start optimizing for feedback.

Small, tested, integrated changes, delivered continuously, will always outperform big batches delivered occasionally.

That’s why teams migrate to TBD. Not because it’s trendy, but because it’s the only path to real continuous integration and continuous delivery.

2.2 - Testing Fundamentals

Build a test architecture that gives your pipeline the confidence to deploy any change, even when dependencies outside your control are unavailable.

Phase 1 - Foundations

Before you can trust your pipeline, you need a test suite that is fast, deterministic, and catches real defects. But a collection of tests is not enough. You need a test architecture - a deliberate structure where different types of tests work together to give you the confidence to deploy every change, regardless of whether external systems are up, slow, or behaving unexpectedly.

Why Testing Is a Foundation

Continuous delivery requires that trunk always be releasable. The only way to know trunk is releasable is to test it - automatically, on every change. Without a reliable test suite, daily integration is just daily risk.

In many organizations, testing is the single biggest obstacle to CD adoption. Not because teams lack tests, but because the tests they have are slow, flaky, poorly structured, and - most critically - unable to give the pipeline a reliable answer to the question: is this change safe to deploy?

Testing Goals for CD

Your test suite must meet these criteria before it can support continuous delivery:

Goal	Target	Why
Fast	Full suite completes in under 10 minutes	Developers need feedback before context-switching
Deterministic	Same code always produces the same test result	Flaky tests destroy trust and get ignored
Catches real bugs	Tests fail when behavior is wrong, not when implementation changes	Brittle tests create noise, not signal
Independent of external systems	Pipeline can determine deployability without any dependency being available	Your ability to deploy cannot be held hostage by someone else’s outage

If your test suite does not meet these criteria today, improving it is your highest-priority foundation work.

Beyond the Test Pyramid

The test pyramid’s core insight is sound: push testing as low as possible. But for CD, the question is not “do we have the right pyramid shape?” The question is: can our pipeline determine that a change is safe to deploy without depending on any system we do not control?

Teams that answer “yes” design a test architecture where fast, deterministic tests catch the vast majority of defects, contract tests verify that test doubles match reality, and a small number of non-deterministic tests run post-deployment as monitoring. For the full breakdown of this architecture, see the Testing section.

The anti-pattern: the ice cream cone

Most teams that struggle with CD have an inverted test distribution - too many slow, expensive end-to-end tests and too few fast, focused tests.

The ice cream cone anti-pattern: an inverted test distribution where most testing effort goes to manual and end-to-end tests at the top, with too few fast unit tests at the bottom

The ice cream cone makes CD impossible. Manual testing gates block every release. End-to-end tests take hours, fail randomly, and depend on external systems being healthy. The pipeline cannot give a fast, reliable answer about deployability, so deployments become high-ceremony events.

What to Test - and What Not To

Before diving into the architecture, internalize the mindset that makes it work. The test architecture below is not just a structure to follow - it flows from a few principles about what testing should focus on and what it should ignore.

Interfaces are the most important thing to test

Most integration failures originate at interfaces - the boundaries where your system talks to other systems. These boundaries are the highest-risk areas in your codebase, and they deserve the most testing attention. But testing interfaces does not require integrating with the real system on the other side.

When you test an interface you consume, the question is: “Can I understand the response and act accordingly?” If you send a request for a user’s information, you do not test that you get that specific user back. You test that you receive and understand the properties you need - that your code can parse the response structure and make correct decisions based on it. This distinction matters because it keeps your tests deterministic and focused on what you control.

Use contract mocks, virtual services, or any test double that faithfully represents the interface contract. The test validates your side of the conversation, not theirs.

Frontend and backend follow the same pattern

Both frontend and backend applications provide interfaces to consumers and consume interfaces from providers. The only difference is the consumer: a frontend provides an interface for humans, while a backend provides one for machines. The testing strategy is the same.

For a frontend:

Validate the interface you provide. The UI contains the components it should and they appear correctly. This is the equivalent of verifying your API returns the right response structure.
Test behavior isolated from presentation. Use your unit test framework to test the logic that UI controls trigger, separated from the rendering layer. This gives you the same speed and control you get from testing backend logic in isolation.
Verify that controls trigger the right logic. Confirm that user actions invoke the correct behavior, without needing a running backend or browser-based E2E test.

This approach gives you targeted testing with far more control. Testing exception flows - what happens when a service returns an error, when a network request times out, when data is malformed - becomes straightforward instead of requiring elaborate E2E setups that are hard to make fail on demand.

If you cannot fix it, do not test for it

This is the principle that most teams get wrong. You should never test the behavior of services you consume. Testing their behavior is the responsibility of the team that builds them. If their service returns incorrect data, you cannot fix that - so testing for it is waste.

What you should test is how your system responds when a consumed service is unstable or unavailable. Can you degrade gracefully? Do you return a meaningful error? Do you retry appropriately? These are behaviors you own and can fix, so they belong in your test suite.

This principle directly enables the test architecture below. When you stop testing things you cannot fix, you stop depending on external systems in your pipeline. Your tests become faster, more deterministic, and more focused on the code your team actually ships.

Test Architecture for the CD Pipeline

A test architecture is the deliberate structure of how different test types work together across your pipeline to give you deployment confidence. The Testing section provides the full architecture reference, including five layers of tests (unit, integration, functional, contract, and end-to-end), how they map to pipeline stages, pre-merge vs post-merge strategies, a decision matrix for choosing test types, and best practices.

The key principle: everything that blocks deployment must be deterministic and under your control. Everything that involves external systems runs asynchronously or post-deployment. This gives you the independence to deploy any time, regardless of the state of the world around you.

Starting Without Full Coverage

Teams often delay adopting CI because their existing code lacks tests. This is backwards. You do not need tests for existing code to begin. You need one rule applied without exception:

Every new change gets a test. We will not go lower than the current level of code coverage.

Record your current coverage percentage as a baseline. Configure CI to fail if coverage drops below that number. This does not mean the baseline is good enough - it means the trend only moves in one direction. Every bug fix, every new feature, and every refactoring adds tests. Over time, coverage grows organically in the areas that matter most: the code that is actively changing.

Do not attempt to retrofit tests across the entire codebase before starting CI. That approach takes months, delivers no incremental value, and often produces low-quality tests written by developers who are testing code they did not write and do not fully understand.

Test Quality Over Coverage Percentage

Code coverage tells you which lines executed during tests. It does not tell you whether the tests verified anything meaningful. A test suite with 90% coverage and no assertions has high coverage and zero value.

Better questions than “what is our coverage percentage?”:

When a test fails, does it point directly to the defect?
When we refactor, do tests break because behavior changed or because implementation details shifted?
Do our tests catch the bugs that actually reach production?
Can a developer trust a green build enough to deploy immediately?

Why coverage mandates are harmful. When teams are required to hit a coverage target, they write tests to satisfy the metric rather than to verify behavior. This produces tests that exercise code paths without asserting outcomes, tests that mirror implementation rather than specify behavior, and tests that inflate the number without improving confidence. The metric goes up while the defect escape rate stays the same. Worse, meaningless tests add maintenance cost and slow down the suite.

Instead of mandating a coverage number, set a floor (as described above) and focus team attention on test quality: mutation testing scores, defect escape rates, and whether developers actually trust the suite enough to deploy on green.

Quick-Start Action Plan

If your test suite is not yet ready to support CD, use this focused action plan to make immediate progress.

Audit your current test suite

Assess where you stand before making changes.

Actions:

Run your full test suite 3 times. Note total duration and any tests that pass intermittently (flaky tests).
Count tests by type: unit, integration, functional, end-to-end.
Identify tests that require external dependencies (databases, APIs, file systems) to run.
Record your baseline: total test count, pass rate, duration, flaky test count.
Map each test type to a pipeline stage. Which tests gate deployment? Which run asynchronously? Which tests couple your deployment to external systems?

Output: A clear picture of your test distribution and the specific problems to address.

Fix or remove flaky tests

Flaky tests are worse than no tests. They train developers to ignore failures, which means real failures also get ignored.

Actions:

Quarantine all flaky tests immediately. Move them to a separate suite that does not block the build.
For each quarantined test, decide: fix it (if the behavior it tests matters) or delete it (if it does not).
Common causes of flakiness: timing dependencies, shared mutable state, reliance on external services, test order dependencies.
Target: zero flaky tests in your main test suite.

Decouple your pipeline from external dependencies

This is the highest-leverage change for CD. Identify every test that calls a real external service and replace that dependency with a test double.

Actions:

List every external service your tests depend on: databases, APIs, message queues, file storage, third-party services.
For each dependency, decide the right test double approach:
- In-memory fakes for databases (e.g., SQLite, H2, testcontainers with local instances).
- HTTP stubs for external APIs (e.g., WireMock, nock, MSW).
- Fakes for message queues, email services, and other infrastructure.
Replace the dependencies in your unit, integration, and functional tests.
Move the original tests that hit real services into a separate suite - these become your starting contract tests or E2E smoke tests.

Output: A test suite where everything that blocks the build is deterministic and runs without network access to external systems.

Add functional tests for critical paths

If you don’t have functional tests (component tests) that exercise your whole service in isolation, start with the most critical paths.

Actions:

Identify the 3-5 most critical user journeys or API endpoints in your application.
Write a functional test for each: boot the application, stub external dependencies, send a real request or simulate a real user action, verify the response.
Each functional test should prove that the feature works correctly assuming external dependencies behave as expected (which your test doubles encode).
Run these in CI on every commit.

Set up contract tests for your most important dependency

Pick the external dependency that changes most frequently or has caused the most production issues. Set up a contract test for it.

Actions:

Write a contract test that validates the response structure (types, required fields, status codes) of the dependency’s API.
Run it on a schedule (e.g., every hour or daily), not on every commit.
When it fails, update your test doubles to match the new reality and re-verify your functional tests.
If the dependency is owned by another team in your organization, explore consumer-driven contracts with a tool like Pact.

Test-Driven Development (TDD)

TDD is the practice of writing the test before the code. It is the most effective way to build a reliable test suite because it ensures every piece of behavior has a corresponding test.

The TDD cycle:

Red: Write a failing test that describes the behavior you want.
Green: Write the minimum code to make the test pass.
Refactor: Improve the code without changing the behavior. The test ensures you do not break anything.

Why TDD supports CD:

Every change is automatically covered by a test
The test suite grows proportionally with the codebase
Tests describe behavior, not implementation, making them more resilient to refactoring
Developers get immediate feedback on whether their change works

TDD is not mandatory for CD, but teams that practice TDD consistently have significantly faster and more reliable test suites.

Getting started with TDD

If your team is new to TDD, start small:

Pick one new feature or bug fix this week.
Write the test first, watch it fail.
Write the code to make it pass.
Refactor.
Repeat for the next change.

Do not try to retroactively TDD your entire codebase. Apply TDD to new code and to any code you modify.

Using Tests to Find and Eliminate Defect Sources

A test suite that catches bugs is good. A test suite that helps you stop producing those bugs is transformational. Every test failure is evidence of a defect, and every defect has a source. If you treat test failures only as things to fix, you are doing rework. If you treat them as diagnostic data about where your process breaks down, you can make systemic changes that prevent entire categories of defects from occurring.

This is the difference between a team that writes more tests to catch more bugs and a team that changes how it works so that fewer bugs are created in the first place.

Two questions sharpen this thinking:

What is the earliest point we can detect this defect? The later a defect is found, the more expensive it is to fix. A requirements defect caught during example mapping costs minutes. The same defect caught in production costs days of incident response, rollback, and rework.
Can AI help us detect it earlier? AI-assisted tools can now surface defects at stages where only human review was previously possible, shifting detection left without adding manual effort.

Trace every defect to its origin

When a test catches a defect - or worse, when a defect escapes to production - ask: where was this defect introduced, and what would have prevented it from being created?

Defects do not originate randomly. They cluster around specific causes. The CD Defect Detection and Remediation Catalog documents over 30 defect types across eight categories, with detection methods, AI opportunities, and systemic fixes for each. The examples below illustrate the pattern for the defect sources most commonly encountered during a CD migration.

Requirements


Example defects	Building the right thing wrong, or the wrong thing right
Earliest detection	Discovery - before coding begins, during story refinement or example mapping
Traditional detection	UX analytics, task completion tracking, A/B testing (all post-deployment)
AI-assisted detection	LLM review of acceptance criteria to flag ambiguity, missing edge cases, or contradictions before development begins. AI-generated test scenarios from user stories to validate completeness.
Systemic fix	Acceptance criteria as user outcomes, not implementation tasks. Three Amigos sessions before work starts. Example mapping to surface edge cases before coding begins.

Missing domain knowledge


Example defects	Business rules encoded incorrectly, implicit assumptions, tribal knowledge loss
Earliest detection	During coding - when the developer writes the logic
Traditional detection	Magic number detection, knowledge-concentration metrics, bus factor analysis from git history
AI-assisted detection	Identify undocumented business rules, missing context that a new developer would hit, and knowledge gaps. Compare implementation against domain documentation or specification files.
Systemic fix	Embed domain rules in code using ubiquitous language (DDD). Pair programming to spread knowledge. Living documentation generated from code. Rotate ownership regularly.

Integration boundaries


Example defects	Interface mismatches, wrong assumptions about upstream behavior, race conditions at service boundaries
Earliest detection	During design - when defining the interface contract
Traditional detection	Consumer-driven contract tests, schema validation, chaos engineering, fault injection
AI-assisted detection	Review code and documentation to identify undocumented behavioral assumptions (timeouts, retries, error semantics). Predict which consumers break from API changes based on usage patterns when formal contracts do not exist.
Systemic fix	Contract tests mandatory per boundary. API-first design. Document behavioral contracts, not just data schemas. Circuit breakers as default at every external boundary.

Untested edge cases


Example defects	Null handling, boundary values, error paths
Earliest detection	Pre-commit - through null-safe type systems and static analysis in the IDE
Traditional detection	Mutation testing, branch coverage thresholds, property-based testing
AI-assisted detection	Analyze code paths and generate tests for untested boundaries, null paths, and error conditions the developer did not consider. Triage surviving mutants by risk.
Systemic fix	Require a test for every bug fix. Adopt property-based testing for logic with many input permutations. Boundary value analysis as a standard practice. Enforce null-safe type systems.

Unintended side effects


Example defects	Change to module A breaks module B, unexpected feature interactions
Earliest detection	At commit time - when CI runs the full test suite
Traditional detection	Mutation testing, change impact analysis, feature flag interaction matrix
AI-assisted detection	Reason about semantic change impact beyond syntactic dependencies. Map a diff to affected modules and flag untested downstream paths before the commit reaches CI.
Systemic fix	Small focused commits. Trunk-based development (integrate daily so side effects surface immediately). Feature flags with controlled rollout. Modular design with clear boundaries.

Accumulated complexity


Example defects	Defects cluster in the most complex, most-changed files
Earliest detection	Continuously - through static analysis in the IDE and CI
Traditional detection	Complexity trends, duplication scoring, dependency cycle detection
AI-assisted detection	Identify architectural drift, abstraction decay, and calcified workarounds that static analysis misses. Cross-reference change frequency with defect history to prioritize refactoring.
Systemic fix	Refactoring as part of every story, not deferred to a “tech debt sprint.” Dedicated complexity budget. Treat rising complexity as a leading indicator.

Process and deployment


Example defects	Long-lived branches causing merge conflicts, manual pipeline steps introducing human error, excessive batching increasing blast radius, weak rollback causing extended outages
Earliest detection	Pre-commit for branch age; CI for pipeline and batching issues
Traditional detection	Branch age alerts, merge conflict frequency, pipeline audit for manual gates, changes-per-deploy metrics, rollback testing
AI-assisted detection	Automated risk scoring from change diffs and deployment history. Blast radius analysis. Auto-approve low-risk changes and flag high-risk with evidence, replacing manual change advisory boards.
Systemic fix	Trunk-based development. Automate every step from commit to production. Single-piece flow with feature flags. Blue/green or canary as default deployment strategy.

Data and state


Example defects	Null pointer exceptions, schema migration failures, cache invalidation errors, concurrency issues
Earliest detection	Pre-commit for null safety; CI for schema compatibility
Traditional detection	Null safety static analysis, schema compatibility checks, migration dry-runs, thread sanitizers
AI-assisted detection	Predict downstream impact of schema changes by understanding how consumers actually use data. Flag code where optional fields are used without null checks, even in non-strict languages.
Systemic fix	Enforce null-safe types. Expand-then-contract for all schema changes. Design for idempotency. Short TTLs over complex cache invalidation.

For the complete catalog covering all defect categories - including product and discovery, dependency and infrastructure, testing and observability gaps, and more - see the CD Defect Detection and Remediation Catalog.

Build a defect feedback loop

Knowing the categories is not enough. You need a process that systematically connects test failures to root causes and root causes to systemic fixes.

Step 1: Classify every defect. When a test fails or a bug is reported, tag it with its origin category from the table above. This takes seconds and builds a dataset over time.

Step 2: Look for patterns. Monthly (or during retrospectives), review the defect classifications. Which categories appear most often? That is where your process is weakest.

Step 3: Apply the systemic fix, not just the local fix. When you fix a bug, also ask: what systemic change would prevent this entire category of bug? If most defects come from integration boundaries, the fix is not “write more integration tests” - it is “make contract tests mandatory for every new boundary.” If most defects come from untested edge cases, the fix is not “increase code coverage” - it is “adopt property-based testing as a standard practice.”

Step 4: Measure whether the fix works. Track defect counts by category over time. If you applied a systemic fix for integration boundary defects and the count does not drop, the fix is not working and you need a different approach.

The test-for-every-bug-fix rule

One of the most effective systemic practices: every bug fix must include a test that reproduces the bug before the fix and passes after. This is non-negotiable for CD because:

It proves the fix actually addresses the defect (not just the symptom).
It prevents the same defect from recurring.
It builds test coverage exactly where the codebase is weakest - the places where bugs actually occur.
Over time, it shifts your test suite from “tests we thought to write” to “tests that cover real failure modes.”

Advanced detection techniques

As your test architecture matures, add techniques that find defects humans overlook:

Technique	What It Finds	When to Adopt
Mutation testing (Stryker, PIT)	Tests that pass but do not actually verify behavior - your test suite’s blind spots	When basic coverage is in place but defect escape rate is not dropping
Property-based testing	Edge cases and boundary conditions across large input spaces that example-based tests miss	When defects cluster around unexpected input combinations
Chaos engineering	Failure modes in distributed systems - what happens when a dependency is slow, returns errors, or disappears	When you have functional tests and contract tests in place and need confidence in failure handling
Static analysis and linting	Null safety violations, type errors, security vulnerabilities, dead code	From day one - these are cheap and fast

For more examples of mapping defect origins to detection methods and systemic corrections, see the CD Defect Detection and Remediation Catalog.

Measuring Success

Metric	Target	Why It Matters
Deterministic suite duration	< 10 minutes	Enables fast feedback loops
Flaky test count	0 in pipeline-gating suite	Maintains trust in test results
External dependencies in gating tests	0	Ensures deployment independence
Test coverage trend	Increasing	Confirms new code is being tested
Defect escape rate	Decreasing	Confirms tests catch real bugs
Contract test freshness	All passing within last 24 hours	Confirms test doubles are current

Next Step

With a reliable test suite in place, automate your build process so that building, testing, and packaging happens with a single command. Continue to Build Automation.

Content contributed by Dojo Consortium, licensed under CC BY 4.0. Additional concepts drawn from Ham Vocke, The Practical Test Pyramid, and Toby Clemson, Testing Strategies in a Microservice Architecture.

Flaky Tests - Symptom of non-deterministic tests that destroy pipeline trust
High Coverage, Ineffective Tests - Symptom where coverage metrics mask poor test quality
Refactoring Breaks Tests - Symptom of white-box tests that assert on implementation details
Slow Test Suites - Symptom caused by an inverted test pyramid or missing test doubles
Environment-Dependent Failures - Symptom of tests coupled to external systems
Inverted Test Pyramid - Anti-pattern where too many slow E2E tests replace fast unit tests
Pressure to Skip Testing - Anti-pattern where testing is treated as optional under deadline pressure

2.3 - Build Automation

Automate your build process so a single command builds, tests, and packages your application.

Phase 1 - Foundations

Build automation is the mechanism that turns trunk-based development and testing into a continuous integration loop. If you cannot build, test, and package your application with a single command, you cannot automate your pipeline. This page covers the practices that make your build reproducible, fast, and trustworthy.

What Build Automation Means

Build automation is the practice of scripting every step required to go from source code to a deployable artifact. A single command - or a single CI trigger - should execute the entire sequence:

Compile the source code (if applicable)
Run all automated tests
Package the application into a deployable artifact (container image, binary, archive)
Report the result (pass or fail, with details)

No manual steps. No “run this script, then do that.” No tribal knowledge about which flags to set or which order to run things. One command, every time, same result.

The Litmus Test

Ask yourself: “Can a new team member clone the repository and produce a deployable artifact with a single command within 15 minutes?”

If the answer is no, your build is not fully automated.

Why Build Automation Matters for CD

CD Requirement	How Build Automation Supports It
Reproducibility	The same commit always produces the same artifact, on any machine
Speed	Automated builds can be optimized, cached, and parallelized
Confidence	If the build passes, the artifact is trustworthy
Developer experience	Developers run the same build locally that CI runs, eliminating “works on my machine”
Pipeline foundation	The CD pipeline is just the build running automatically on every commit

Without build automation, every other practice in this guide breaks down. You cannot have continuous integration if the build requires manual intervention. You cannot have a deterministic pipeline if the build produces different results depending on who runs it.

Key Practices

1. Version-Controlled Build Scripts

Your build configuration lives in the same repository as your code. It is versioned, reviewed, and tested alongside the application.

What belongs in version control:

Build scripts (Makefile, build.gradle, package.json scripts, Dockerfile)
Dependency manifests (requirements.txt, go.mod, pom.xml, package-lock.json)
Pipeline definitions (.github/workflows, .gitlab-ci.yml, Jenkinsfile)
Environment setup scripts (docker-compose.yml for local development)

What does not belong in version control:

Secrets and credentials (use secret management tools)
Environment-specific configuration values (use environment variables or config management)
Generated artifacts (build outputs, compiled binaries)

Anti-pattern: Build instructions that exist only in a wiki, a Confluence page, or one developer’s head. If the build steps are not in the repository, they will drift from reality.

2. Dependency Management

All dependencies must be declared explicitly and resolved deterministically.

Practices:

Lock files: Use lock files (package-lock.json, Pipfile.lock, go.sum) to pin exact dependency versions. Check lock files into version control.
Reproducible resolution: Running the dependency install twice should produce identical results.
No undeclared dependencies: Your build should not rely on tools or libraries that happen to be installed on the build machine. If you need it, declare it.
Dependency scanning: Automate vulnerability scanning of dependencies as part of the build. Do not wait for a separate security review.

Anti-pattern: “It builds on Jenkins because Jenkins has Java 11 installed, but the Dockerfile uses Java 17.” The build must declare and control its own runtime.

3. Build Caching

Fast builds keep developers in flow. Caching is the primary mechanism for build speed.

What to cache:

Dependencies: Download once, reuse across builds. Most build tools (npm, Maven, Gradle, pip) support a local cache.
Compilation outputs: Incremental compilation avoids rebuilding unchanged modules.
Docker layers: Structure your Dockerfile so that rarely-changing layers (OS, dependencies) are cached and only the application code layer is rebuilt.
Test fixtures: Prebuilt test data or container images used by tests.

Guidelines:

Cache aggressively for local development and CI
Invalidate caches when dependencies or build configuration change
Do not cache test results - tests must always run

4. Single Build Script Entry Point

Developers, CI, and CD should all use the same entry point.

Makefile as single build entry point

# Example: Makefile as the single entry point

.PHONY: build test package all

all: build test package

build:
	./gradlew compileJava

test:
	./gradlew test

package:
	docker build -t myapp:$(GIT_SHA) .

clean:
	./gradlew clean
	docker rmi myapp:$(GIT_SHA) || true

The CI server runs make all. A developer runs make all. The result is the same. There is no separate “CI build script” that diverges from what developers run locally.

5. Artifact Versioning

Every build artifact must be traceable to the exact commit that produced it.

Practices:

Tag artifacts with the Git commit SHA or a build number derived from it
Store build metadata (commit, branch, timestamp, builder) in the artifact or alongside it
Never overwrite an existing artifact - if the version exists, the artifact is immutable

This becomes critical in Phase 2 when you establish immutable artifact practices.

CI Server Setup Basics

The CI server is the mechanism that runs your build automatically. In Phase 1, the setup is straightforward:

What the CI Server Does

Watches the trunk for new commits
Runs the build (the same command a developer would run locally)
Reports the result (pass/fail, test results, build duration)
Notifies the team if the build fails

Minimum CI Configuration

Regardless of which CI tool you use (GitHub Actions, GitLab CI, Jenkins, CircleCI), the configuration follows the same pattern:

Conceptual minimum CI configuration

# Conceptual CI configuration (adapt to your tool)
trigger:
  branch: main  # Run on every commit to trunk

steps:
  - checkout: source code
  - install: dependencies
  - run: build
  - run: tests
  - run: package
  - report: test results and build status

CI Principles for Phase 1

Run on every commit. Not nightly, not weekly, not “when someone remembers.” Every commit to trunk triggers a build.
Keep the build green. A failing build is the team’s top priority. Work stops until trunk is green again. (See Working Agreements.)
Run the same build everywhere. The CI server runs the same script as local development. No CI-only steps that developers cannot reproduce.
Fail fast. Run the fastest checks first (compilation, unit tests) before the slower ones (integration tests, packaging).

Build Time Targets

Build speed directly affects developer productivity and integration frequency. If the build takes 30 minutes, developers will not integrate multiple times per day.

Build Phase	Target	Rationale
Compilation	< 1 minute	Developers need instant feedback on syntax and type errors
Unit tests	< 3 minutes	Fast enough to run before every commit
Integration tests	< 5 minutes	Must complete before the developer context-switches
Full build (compile + test + package)	< 10 minutes	The outer bound for fast feedback

If Your Build Is Too Slow

Slow builds are a common constraint that blocks CD adoption. Address them systematically:

Profile the build. Identify which steps take the most time. Optimize the bottleneck, not everything.
Parallelize tests. Most test frameworks support parallel execution. Run independent test suites concurrently.
Use build caching. Avoid recompiling or re-downloading unchanged dependencies.
Split the build. Run fast checks (lint, compile, unit tests) as a “fast feedback” stage. Run slower checks (integration tests, security scans) as a second stage.
Upgrade build hardware. Sometimes the fastest optimization is more CPU and RAM.

The target is under 10 minutes for the feedback loop that developers use on every commit. Longer-running validation (E2E tests, performance tests) can run in a separate stage.

Common Anti-Patterns

Manual Build Steps

Symptom: The build process includes steps like “open this tool and click Run” or “SSH into the build server and execute this script.”

Problem: Manual steps are error-prone, slow, and cannot be parallelized or cached. They are the single biggest obstacle to build automation.

Fix: Script every step. If a human must perform the step today, write a script that performs it tomorrow.

Environment-Specific Builds

Symptom: The build produces different artifacts for different environments (dev, staging, production). Or the build only works on specific machines because of pre-installed tools.

Problem: Environment-specific builds mean you are not testing the same artifact you deploy. Bugs that appear in production but not in staging become impossible to diagnose.

Fix: Build one artifact and configure it per environment at deployment time. The artifact is immutable; the configuration is external. (See Application Config in Phase 2.)

Build Scripts That Only Run in CI

Symptom: The CI pipeline has build steps that developers cannot run locally. Local development uses a different build process.

Problem: Developers cannot reproduce CI failures locally, leading to slow debugging cycles and “push and pray” development.

Fix: Use a single build entry point (Makefile, build script) that both CI and developers use. CI configuration should only add triggers and notifications, not build logic.

Missing Dependency Pinning

Symptom: Builds break randomly because a dependency released a new version overnight.

Problem: Without pinned dependencies, the build is non-deterministic. The same code can produce different results on different days.

Fix: Use lock files. Pin all dependency versions. Update dependencies intentionally, not accidentally.

Long Build Queues

Symptom: Developers commit to trunk, but the build does not run for 20 minutes because the CI server is processing a queue.

Problem: Delayed feedback defeats the purpose of CI. If developers do not see the result of their commit for 30 minutes, they have already moved on.

Fix: Ensure your CI infrastructure can handle your team’s commit frequency. Use parallel build agents. Prioritize builds on the main branch.

Measuring Success

Metric	Target	Why It Matters
Build duration	< 10 minutes	Enables fast feedback and frequent integration
Build success rate	> 95%	Indicates reliable, reproducible builds
Time from commit to build result	< 15 minutes (including queue time)	Measures the full feedback loop
Developer ability to build locally	100% of team	Confirms the build is portable and documented

Next Step

With build automation in place, you can build, test, and package your application reliably. The next foundation is ensuring that the work you integrate daily is small enough to be safe. Continue to Work Decomposition.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Slow Pipelines - Symptom caused by unoptimized or missing build automation
Works on My Machine - Symptom eliminated when the build runs the same everywhere
Missing Deployment Pipeline - Anti-pattern where no automated path from commit to production exists
Snowflake Environments - Anti-pattern caused by environment-specific builds
Everything as Code - Companion guide for versioning build scripts, pipelines, and infrastructure
Build Duration - Metric for tracking build speed improvements

2.4 - Work Decomposition

Break features into small, deliverable increments that can be completed in 2 days or less.

Phase 1 - Foundations

Trunk-based development requires daily integration, and daily integration requires small work. If a feature takes two weeks to build, you cannot integrate it daily without decomposing it first. This page covers the techniques for breaking work into small, deliverable increments that flow through your pipeline continuously.

Why Small Work Matters for CD

Continuous delivery depends on a simple equation: small changes, integrated frequently, are safer than large changes integrated rarely.

Every practice in Phase 1 reinforces this:

Trunk-based development requires that you integrate at least daily. You cannot integrate a two-week feature daily unless you decompose it.
Testing fundamentals work best when each change is small enough to test thoroughly.
Code review is fast when the change is small. A 50-line change can be reviewed in minutes. A 2,000-line change takes hours - if it gets reviewed at all.

The data supports this. The DORA research consistently shows that smaller batch sizes correlate with higher delivery performance. Small changes have:

Lower risk: If a small change breaks something, the blast radius is limited, and the cause is obvious.
Faster feedback: A small change gets through the pipeline quickly. You learn whether it works today, not next week.
Easier rollback: Rolling back a 50-line change is straightforward. Rolling back a 2,000-line change often requires a new deployment.
Better flow: Small work items move through the system predictably. Large work items block queues and create bottlenecks.

The 2-Day Rule

If a work item takes longer than 2 days to complete, it is too big.

This is not arbitrary. Two days gives you at least one integration to trunk per day (the minimum for TBD) and allows for the natural rhythm of development: plan, implement, test, integrate, move on.

When a developer says “this will take a week,” the answer is not “go faster.” The answer is “break it into smaller pieces.”

What “Complete” Means

A work item is complete when it is:

Integrated to trunk
All tests pass
The change is deployable (even if the feature is not yet user-visible)
It meets the Definition of Done

If a story requires a feature flag to hide incomplete user-facing behavior, that is fine. The code is still integrated, tested, and deployable.

Story Slicing Techniques

Story slicing is the practice of breaking user stories into the smallest possible increments that still deliver value or make progress toward delivering value.

The INVEST Criteria

Good stories follow INVEST:

Criterion	Meaning	Why It Matters for CD
Independent	Can be developed and deployed without waiting for other stories	Enables parallel work and avoids blocking
Negotiable	Details can be discussed and adjusted	Allows the team to find the smallest valuable slice
Valuable	Delivers something meaningful to the user or the system	Prevents “technical stories” that do not move the product forward
Estimable	Small enough that the team can reasonably estimate it	Large stories are unestimable because they hide unknowns
Small	Completable within 2 days	Enables daily integration and fast feedback
Testable	Has clear acceptance criteria that can be automated	Supports the testing foundation

Vertical Slicing

The most important slicing technique for CD is vertical slicing: cutting through all layers of the application to deliver a thin but complete slice of functionality.

Vertical slice (correct):

“As a user, I can log in with my email and password.”
This slice touches the UI (login form), the API (authentication endpoint), and the database (user lookup). It is deployable and testable end-to-end.

Horizontal slice (anti-pattern):

“Build the database schema for user accounts.” “Build the authentication API.” “Build the login form UI.”
Each horizontal slice is incomplete on its own. None is deployable. None is testable end-to-end. They create dependencies between work items and block flow.

Vertical slicing in distributed systems

The example above assumes a team that owns every layer from the UI to the database. In large distributed systems, most teams own a subdomain. They are full-stack within that subdomain but may not own any user-facing surface.

The principle does not change. A vertical slice still cuts through all layers end-to-end. “End-to-end” means different things in each context.

Full-stack product team - owns everything from UI to database; their consumer is a human:

graph TD
    User([Human User]) --> UI["UI Layer\n(your team)"]
    UI --> API["API Layer\n(your team)"]
    API --> DB[("Database\n(your team)")]

A vertical slice: one behavior delivered through the UI, the API, and the database in a single deployable change.

Subdomain product team - full-stack within their service; their consumer is another service or team:

graph TD
    User2([Human User]) --> FE["Frontend Service\n(other team)"]
    FE --> API2["Your Service API\n(your team)"]
    API2 --> DB2[("Your Database\n(your team)")]
    API2 --> DS["Downstream Service\n(other team)"]

A vertical slice: one behavior delivered through the service boundary (the API contract), the business logic, and the data store. The team does not own or coordinate with any consumer - whether a UI or another service - except through the API contract. They define a stable contract and deploy behind it independently.

The real difference between these two contexts is whether the public interface is designed for humans or machines. A full-stack product team owns a human-facing surface: the slice is done when a user can observe the behavior through that interface. A subdomain product team owns a machine-facing surface: the slice is done when the API contract satisfies the agreed behavior for its service consumers. In both cases, the question is the same - does this change deliver complete, observable behavior through the interface your team owns? If it only touches one layer beneath that interface, it is a horizontal slice regardless of how you label it.

When teams in a distributed system split work by layer - schema changes in one story, business logic in another, contract changes in a third - nothing is deployable until all layers converge. Slicing vertically within the domain means each story is independently deployable behind a stable contract. See Horizontal Slicing for the full treatment of this failure mode in distributed systems.

Slicing Strategies

When a story feels too big, apply one of these strategies:

Strategy	How It Works	Example
By workflow step	Implement one step of a multi-step process	“User can add items to cart” (before “user can checkout”)
By business rule	Implement one rule at a time	“Orders over $100 get free shipping” (before “orders ship to international addresses”)
By data variation	Handle one data type first	“Support credit card payments” (before “support PayPal”)
By operation	Implement CRUD operations separately	“Create a new customer” (before “edit customer” or “delete customer”)
By performance	Get it working first, optimize later	“Search returns results” (before “search returns results in under 200ms”)
By platform	Support one platform first	“Works on desktop web” (before “works on mobile”)
Happy path first	Implement the success case first	“User completes checkout” (before “user sees error when payment fails”)

Example: Decomposing a Feature

Original story (too big):

“As a user, I can manage my profile including name, email, avatar, password, notification preferences, and two-factor authentication.”

Decomposed into vertical slices:

“User can view their current profile information” (read-only display)
“User can update their name” (simplest edit)
“User can update their email with verification” (adds email flow)
“User can upload an avatar image” (adds file handling)
“User can change their password” (adds security validation)
“User can configure notification preferences” (adds preferences)
“User can enable two-factor authentication” (adds 2FA flow)

Each slice is independently deployable, testable, and completable within 2 days. Each delivers incremental value. The feature is built up over a series of small deliveries rather than one large batch.

BDD as a Decomposition Tool

Behavior-Driven Development (BDD) is not just a testing practice - it is a powerful tool for decomposing work into small, clear increments.

Three Amigos

Before work begins, hold a brief “Three Amigos” session with three perspectives:

Business/Product: What should this feature do? What is the expected behavior?
Development: How will we build it? What are the technical considerations?
Testing: How will we verify it? What are the edge cases?

This 15-30 minute conversation accomplishes two things:

Shared understanding: Everyone agrees on what “done” looks like before work begins.
Natural decomposition: Discussing specific scenarios reveals natural slice boundaries.

Specification by Example

Write acceptance criteria as concrete examples, not abstract requirements.

Abstract (hard to slice):

“The system should validate user input.”

Concrete (easy to slice):

Given an email field, when the user enters “not-an-email”, then the form shows “Please enter a valid email address.”
Given a password field, when the user enters fewer than 8 characters, then the form shows “Password must be at least 8 characters.”
Given a name field, when the user leaves it blank, then the form shows “Name is required.”

Each concrete example can become its own story or task. The scope is clear, the acceptance criteria are testable, and the work is small.

Given-When-Then Format

Structure acceptance criteria in Given-When-Then format to make them executable:

Given-When-Then: user login scenarios

Feature: User login

  Scenario: Successful login with valid credentials
    Given a registered user with email "user@example.com"
    When they enter their correct password and click "Log in"
    Then they are redirected to the dashboard

  Scenario: Failed login with wrong password
    Given a registered user with email "user@example.com"
    When they enter an incorrect password and click "Log in"
    Then they see the message "Invalid email or password"
    And they remain on the login page

Each scenario is a natural unit of work. Implement one scenario at a time, integrate to trunk after each one.

Task Decomposition Within Stories

Even well-sliced stories may contain multiple tasks. Decompose stories into tasks that can be completed and integrated independently.

Example story: “User can update their name”

Tasks:

Add the name field to the profile API endpoint (backend change, integration test)
Add the name field to the profile form (frontend change, unit test)
Connect the form to the API endpoint (integration, E2E test)

Each task results in a commit to trunk. The story is completed through a series of small integrations, not one large merge.

Guidelines for task decomposition:

Each task should take hours, not days
Each task should leave trunk in a working state after integration
Tasks should be ordered so that the simplest changes come first
If a task requires a feature flag or stub to be integrated safely, that is fine

Common Anti-Patterns

Horizontal Slicing

Symptom: Stories are organized by architectural layer: “build the database schema,” “build the API,” “build the UI.”

Problem: No individual slice is deployable or testable end-to-end. Integration happens at the end, which is where bugs are found and schedules slip.

Fix: Slice vertically. Every story should touch all the layers needed to deliver a thin slice of complete functionality.

Technical Stories

Symptom: The backlog contains stories like “refactor the database access layer” or “upgrade to React 18” that do not deliver user-visible value.

Problem: Technical work is important, but when it is separated from feature work, it becomes hard to prioritize and easy to defer. It also creates large, risky changes.

Fix: Embed technical improvements in feature stories. Refactor as you go. If a technical change is necessary, tie it to a specific business outcome and keep it small enough to complete in 2 days.

Stories That Are Really Epics

Symptom: A story has 10+ acceptance criteria, or the estimate is “8 points” or “2 weeks.”

Problem: Large stories hide unknowns, resist estimation, and cannot be integrated daily.

Fix: If a story has more than 3-5 acceptance criteria, it is an epic. Break it into smaller stories using the slicing strategies above.

Splitting by Role Instead of by Behavior

Symptom: Separate stories for “frontend developer builds the UI” and “backend developer builds the API.”

Problem: This creates handoff dependencies and delays integration. The feature is not testable until both stories are complete.

Fix: Write stories from the user’s perspective. The same developer (or pair) implements the full vertical slice.

Deferring “Edge Cases” Indefinitely

Symptom: The team builds the happy path and creates a backlog of “handle error case X” stories that never get prioritized.

Problem: Error handling is not optional. Unhandled edge cases become production incidents.

Fix: Include the most important error cases in the initial story decomposition. Use the “happy path first” slicing strategy, but schedule edge case stories immediately after, not “someday.”

Measuring Success

Metric	Target	Why It Matters
Story cycle time	< 2 days from start to trunk	Confirms stories are small enough
Development cycle time	Decreasing	Shows improved flow from smaller work
Stories completed per week	Increasing (with same team size)	Indicates better decomposition and less rework
Work in progress	Decreasing	Fewer large stories blocking the pipeline

Next Step

Small, well-decomposed work flows through the system quickly - but only if code review does not become a bottleneck. Continue to Code Review to learn how to keep review fast and effective.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Too Much WIP - Symptom caused by large work items that block the pipeline
Work Items Take Too Long - Symptom that smaller decomposition directly addresses
Monolithic Work Items - Anti-pattern where stories are too large to integrate daily
Horizontal Slicing - Anti-pattern where work is split by layer instead of by user value
Development Cycle Time - Metric that improves with smaller work items
Work in Progress - Metric for tracking WIP limits and flow

2.5 - Code Review

Streamline code review to provide fast feedback without blocking flow.

Phase 1 - Foundations

Code review is essential for quality, but it is also the most common bottleneck in teams adopting trunk-based development. If reviews take days, daily integration is impossible. This page covers review techniques that maintain quality while enabling the flow that CD requires.

Why Code Review Matters for CD

Code review serves multiple purposes:

Defect detection: A second pair of eyes catches bugs that the author missed.
Knowledge sharing: Reviews spread understanding of the codebase across the team.
Consistency: Reviews enforce coding standards and architectural patterns.
Mentoring: Junior developers learn by having their code reviewed and by reviewing others’ code.

These are real benefits. The challenge is that traditional code review - open a pull request, wait for someone to review it, address comments, wait again - is too slow for CD.

In a CD workflow, code review must happen within minutes or hours, not days. The review is still rigorous, but the process is designed for speed.

The Core Tension: Quality vs. Flow

Traditional teams optimize review for thoroughness: detailed comments, multiple reviewers, extensive back-and-forth. This produces high-quality reviews but blocks flow.

CD teams optimize review for speed without sacrificing the quality that matters. The key insight is that most of the quality benefit of code review comes from small, focused reviews done quickly, not from exhaustive reviews done slowly.

Traditional Review	CD-Compatible Review
Review happens after the feature is complete	Review happens continuously throughout development
Large diffs (hundreds or thousands of lines)	Small diffs (< 200 lines, ideally < 50)
Multiple rounds of feedback and revision	One round, or real-time feedback during pairing
Review takes 1-3 days	Review takes minutes to a few hours
Review is asynchronous by default	Review is synchronous by preference
2+ reviewers required	1 reviewer (or pairing as the review)

Synchronous vs. Asynchronous Review

Synchronous Review (Preferred for CD)

In synchronous review, the reviewer and author are engaged at the same time. Feedback is immediate. Questions are answered in real time. The review is done when the conversation ends.

Methods:

Pair programming: Two developers work on the same code at the same time. Review is continuous. There is no separate review step because the code was reviewed as it was written.
Mob programming: The entire team (or a subset) works on the same code together. Everyone reviews in real time.
Over-the-shoulder review: The author walks the reviewer through the change in person or on a video call. The reviewer asks questions and provides feedback immediately.

Advantages for CD:

Zero wait time between “ready for review” and “review complete”
Higher bandwidth communication (tone, context, visual cues) catches more issues
Immediate resolution of questions - no async back-and-forth
Knowledge transfer happens naturally through the shared work

Asynchronous Review (When Necessary)

Sometimes synchronous review is not possible - time zones, schedules, or team preferences may require asynchronous review. This is fine, but it must be fast.

Rules for async review in a CD workflow:

Review within 2 hours. If a pull request sits for a day, it blocks integration. Set a team working agreement: “pull requests are reviewed within 2 hours during working hours.”
Keep changes small. A 50-line change can be reviewed in 5 minutes. A 500-line change takes an hour and reviewers procrastinate on it.
Use draft PRs for early feedback. If you want feedback on an approach before the code is complete, open a draft PR. Do not wait until the change is “perfect.”
Avoid back-and-forth. If a comment requires discussion, move to a synchronous channel (call, chat). Async comment threads that go 5 rounds deep are a sign the change is too large or the design was not discussed upfront.

Review Techniques Compatible with TBD

Pair Programming as Review

When two developers pair on a change, the code is reviewed as it is written. There is no separate review step, no pull request waiting for approval, and no delay to integration.

How it works with TBD:

Two developers sit together (physically or via screen share)
They discuss the approach, write the code, and review each other’s decisions in real time
When the change is ready, they commit to trunk together
Both developers are accountable for the quality of the code

When to pair:

New or unfamiliar areas of the codebase
Changes that affect critical paths
When a junior developer is working on a change (pairing doubles as mentoring)
Any time the change involves design decisions that benefit from discussion

Pair programming satisfies most organizations’ code review requirements because two developers have actively reviewed and approved the code.

Mob Programming as Review

Mob programming extends pairing to the whole team. One person drives (types), one person navigates (directs), and the rest observe and contribute.

When to mob:

Establishing new patterns or architectural decisions
Complex changes that benefit from multiple perspectives
Onboarding new team members to the codebase
Working through particularly difficult problems

Mob programming is intensive but highly effective. Every team member understands the code, the design decisions, and the trade-offs.

Rapid Async Review

For teams that use pull requests, rapid async review adapts the pull request workflow for CD speed.

Practices:

Auto-assign reviewers. Do not wait for someone to volunteer. Use tools to automatically assign a reviewer when a PR is opened.
Keep PRs small. Target < 200 lines of changed code. Smaller PRs get reviewed faster and more thoroughly.
Provide context. Write a clear PR description that explains what the change does, why it is needed, and how to verify it. A good description reduces review time dramatically.
Use automated checks. Run linting, formatting, and tests before the human review. The reviewer should focus on logic and design, not style.
Approve and merge quickly. If the change looks correct, approve it. Do not hold it for nitpicks. Nitpicks can be addressed in a follow-up commit.

What to Review

Not everything in a code change deserves the same level of scrutiny. Focus reviewer attention where it matters most.

High Priority (Reviewer Should Focus Here)

Behavior correctness: Does the code do what it is supposed to do? Are edge cases handled?
Security: Does the change introduce vulnerabilities? Are inputs validated? Are secrets handled properly?
Clarity: Can another developer understand this code in 6 months? Are names clear? Is the logic straightforward?
Test coverage: Are the new behaviors tested? Do the tests verify the right things?
API contracts: Do changes to public interfaces maintain backward compatibility? Are they documented?
Error handling: What happens when things go wrong? Are errors caught, logged, and surfaced appropriately?

Low Priority (Automate Instead of Reviewing)

Code style and formatting: Use automated formatters (Prettier, Black, gofmt). Do not waste reviewer time on indentation and bracket placement.
Import ordering: Automate with linting rules.
Naming conventions: Enforce with lint rules where possible. Only flag naming in review if it genuinely harms readability.
Unused variables or imports: Static analysis tools catch these instantly.
Consistent patterns: Where possible, encode patterns in architecture decision records and lint rules rather than relying on reviewers to catch deviations.

Rule of thumb: If a style or convention issue can be caught by a machine, do not ask a human to catch it. Reserve human attention for the things machines cannot evaluate: correctness, design, clarity, and security.

Review Scope for Small Changes

In a CD workflow, most changes are small - tens of lines, not hundreds. This changes the economics of review.

Change Size	Expected Review Time	Review Depth
< 20 lines	2-5 minutes	Quick scan: is it correct? Any security issues?
20-100 lines	5-15 minutes	Full review: behavior, tests, clarity
100-200 lines	15-30 minutes	Detailed review: design, contracts, edge cases
> 200 lines	Consider splitting the change	Large changes get superficial reviews

Research consistently shows that reviewer effectiveness drops sharply after 200-400 lines. If you are regularly reviewing changes larger than 200 lines, the problem is not the review process - it is the work decomposition.

Working Agreements for Review SLAs

Establish clear team agreements about review expectations. Without explicit agreements, review latency will drift based on individual habits.

Recommended Review Agreements

Agreement	Target
Response time	Review within 2 hours during working hours
Reviewer count	1 reviewer (or pairing as the review)
PR size	< 200 lines of changed code
Blocking issues only	Only block a merge for correctness, security, or significant design issues
Nitpicks	Use a “nit:” prefix. Nitpicks are suggestions, not merge blockers
Stale PRs	PRs open for > 24 hours are escalated to the team
Self-review	Author reviews their own diff before requesting review

How to Enforce Review SLAs

Track review turnaround time. If it consistently exceeds 2 hours, discuss it in retrospectives.
Make review a first-class responsibility, not something developers do “when they have time.”
If a reviewer is unavailable, any other team member can review. Do not create single-reviewer dependencies.
Consider pairing as the default and async review as the exception. This eliminates the review bottleneck entirely.

Code Review and Trunk-Based Development

Code review and TBD work together, but only if review does not block integration. Here is how to reconcile them:

TBD Requirement	How Review Adapts
Integrate to trunk at least daily	Reviews must complete within hours, not days
Branches live < 24 hours	PRs are opened and merged within the same day
Trunk is always releasable	Reviewers focus on correctness, not perfection
Small, frequent changes	Small changes are reviewed quickly and thoroughly

If your team finds that review is the bottleneck preventing daily integration, the most effective solution is to adopt pair programming. It eliminates the review step entirely by making review continuous.

Measuring Success

Metric	Target	Why It Matters
Review turnaround time	< 2 hours	Prevents review from blocking integration
PR size (lines changed)	< 200 lines	Smaller PRs get faster, more thorough reviews
PR age at merge	< 24 hours	Aligns with TBD branch age constraint
Review rework cycles	< 2 rounds	Multiple rounds indicate the change is too large or design was not discussed upfront

Next Step

Code review practices need to be codified in team agreements alongside other shared commitments. Continue to Working Agreements to establish your team’s definitions of done, ready, and CI practice.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

PRs Waiting for Review - Symptom that slow review practices cause
Work Items Take Too Long - Symptom worsened when review blocks flow
Knowledge Silos - Anti-pattern that pairing and mob programming help break down
Trunk-Based Development - Branching strategy that requires fast review to work
Work Decomposition - Smaller work items make review faster and more effective
Working Agreements - Where review SLAs are codified as team commitments

2.6 - Working Agreements

Establish shared definitions of done and ready to align the team on quality and process.

Phase 1 - Foundations

The practices in Phase 1 - trunk-based development, testing, small work, and fast review - only work when the whole team commits to them. Working agreements make that commitment explicit. This page covers the key agreements a team needs before moving to pipeline automation in Phase 2.

Why Working Agreements Matter

A working agreement is a shared commitment that the team creates, owns, and enforces together. It is not a policy imposed from outside. It is the team’s own answer to the question: “How do we work together?”

Without working agreements, CD practices drift. One developer integrates daily; another keeps a branch for a week. One developer fixes a broken build immediately; another waits until after lunch. These inconsistencies compound. Within weeks, the team is no longer practicing CD - they are practicing individual preferences.

Working agreements prevent this drift by making expectations explicit. When everyone agrees on what “done” means, what “ready” means, and how CI works, the team can hold each other accountable without conflict.

Definition of Done

The Definition of Done (DoD) is the team’s shared standard for when a work item is complete. For CD, the Definition of Done must include deployment.

Minimum Definition of Done for CD

A work item is done when all of the following are true:

Code is integrated to trunk
All automated tests pass
Code has been reviewed (via pairing, mob, or pull request)
The change is deployable to production
No known defects are introduced
Relevant documentation is updated (API docs, runbooks, etc.)
Feature flags are in place for incomplete user-facing features

Why “Deployed to Production” Matters

Many teams define “done” as “code is merged.” This creates a gap between “done” and “delivered.” Work accumulates in a staging environment, waiting for a release. Risk grows with each unreleased change.

In a CD organization, “done” means the change is in production (or ready to be deployed to production at any time). This is the ultimate test of completeness: the change works in the real environment, with real data, under real load.

In Phase 1, you may not yet have the pipeline to deploy every change to production automatically. That is fine - your DoD should still include “deployable to production” as the standard, even if the deployment step is not yet automated. The pipeline work in Phase 2 will close that gap.

Extending Your Definition of Done

As your CD maturity grows, extend the DoD:

Phase	Addition to DoD
Phase 1 (Foundations)	Code integrated to trunk, tests pass, reviewed, deployable
Phase 2 (Pipeline)	Artifact built and validated by the pipeline
Phase 3 (Optimize)	Change deployed to production behind a feature flag
Phase 4 (Deliver on Demand)	Change deployed to production and monitored

Definition of Ready

The Definition of Ready (DoR) answers: “When is a work item ready to be worked on?” Pulling unready work into development creates waste - unclear requirements lead to rework, missing acceptance criteria lead to untestable changes, and oversized stories lead to long-lived branches.

Minimum Definition of Ready for CD

A work item is ready when all of the following are true:

Acceptance criteria are defined and specific (using Given-When-Then or equivalent)
The work item is small enough to complete in 2 days or less
The work item is testable - the team knows how to verify it works
Dependencies are identified and resolved (or the work item is independent)
The team has discussed the work item (Three Amigos or equivalent)
The work item is estimated (or the team has agreed estimation is unnecessary for items this small)

Common Mistakes with Definition of Ready

Making it too rigid. The DoR is a guideline, not a gate. If the team agrees a work item is understood well enough, it is ready. Do not use the DoR to avoid starting work.
Requiring design documents. For small work items (< 2 days), a conversation and acceptance criteria are sufficient. Formal design documents are for larger initiatives.
Skipping the conversation. The DoR is most valuable as a prompt for discussion, not as a checklist. The Three Amigos conversation matters more than the checkboxes.

CI Working Agreement

The CI working agreement codifies how the team practices continuous integration. This is the most operationally critical working agreement for CD.

The CI Agreement

The team agrees to the following practices:

Integration:

Every developer integrates to trunk at least once per day
Branches (if used) live for less than 24 hours
No long-lived feature, development, or release branches

Build:

All tests must pass before merging to trunk
The build runs on every commit to trunk
Build results are visible to the entire team

Broken builds:

A broken build is the team’s top priority - it is fixed before any new work begins
The developer(s) who broke the build are responsible for fixing it immediately
If the fix will take more than 10 minutes, revert the change and fix it offline
No one commits to a broken trunk (except to fix the break)

Work in progress:

Finishing existing work takes priority over starting new work
The team limits work in progress to maintain flow
If a developer is blocked, they help a teammate before starting a new story

Why “Broken Build = Top Priority”

This is the single most important CI agreement. When the build is broken:

No one can integrate safely. Changes are stacking up.
Trunk is not releasable. The team has lost its safety net.
Every minute the build stays broken, the team accumulates risk.

“Fix the build” is not a suggestion. It is an agreement that the team enforces collectively. If the build is broken and someone starts a new feature instead of fixing it, the team should call that out. This is not punitive - it is the team protecting its own ability to deliver.

Stop the Line - Why All Work Stops

Some teams interpret “fix the build” as “stop merging until it is green.” That is not enough. When the build is red, all feature work stops - not just merges. Every developer on the team shifts attention to restoring green.

This sounds extreme, but the reasoning is straightforward:

Work closer to production is more valuable than work further away. A broken trunk means nothing in progress can ship. Fixing the build is the highest-leverage activity anyone on the team can do.
Continuing feature work creates a false sense of progress. Code written against a broken trunk is untested against the real baseline. It may compile, but it has not been validated. That is not progress - it is inventory.
The team mindset matters more than the individual fix. When everyone stops, the message is clear: the build belongs to the whole team, not just the person who broke it. This shared ownership is what separates teams that practice CI from teams that merely have a CI server.

Two Timelines: Stop vs. Do Not Stop

Consider two teams that encounter the same broken build at 10:00 AM.

Team A stops all feature work:

10:00 - Build breaks. The team sees the alert and stops.
10:05 - Two developers pair on the fix while a third reviews the failing test.
10:20 - Fix is pushed. Build goes green.
10:25 - The team resumes feature work. Total disruption: roughly 30 minutes.

Team B treats it as one person’s problem:

10:00 - Build breaks. The developer who caused it starts investigating alone.
10:30 - Other developers commit new changes on top of the broken trunk. Some changes conflict with the fix in progress.
11:30 - The original developer’s fix does not work because the codebase has shifted underneath them.
14:00 - After multiple failed attempts, the team reverts three commits (the original break plus two that depended on the broken state).
15:00 - Trunk is finally green. The team has lost most of the day, and three developers need to redo work. Total disruption: 5+ hours.

The team that stops immediately pays a small, predictable cost. The team that does not stop pays a large, unpredictable one.

The Revert Rule

If a broken build cannot be fixed within 10 minutes, revert the offending commit and fix the issue on a branch. This keeps trunk green and unblocks the rest of the team. The developer who made the change is not being punished - they are protecting the team’s flow.

Reverting feels uncomfortable at first. Teams worry about “losing work.” But a reverted commit is not lost - the code is still in the Git history. The developer can re-apply their change after fixing the issue. The alternative - a broken trunk for hours while someone debugs - is far more costly.

When to Forward Fix vs. Revert

Not every broken build requires a revert. If the developer who broke it can identify the cause quickly, a forward fix is faster and simpler. The key is a strict time limit:

Start a 15-minute timer the moment the build goes red.
If the developer has a fix ready and pushed within 15 minutes, ship the forward fix.
If the timer expires and the fix is not in trunk, revert immediately - no extensions, no “I’m almost done.”

The timer prevents the most common failure mode: a developer who is “five minutes away” from a fix for an hour. After 15 minutes without a fix, the probability of a quick resolution drops sharply, and the cost to the rest of the team climbs. Revert, restore green, and fix the problem offline without time pressure.

Common Objections to Stop-the-Line

Teams adopting stop-the-line discipline encounter predictable pushback. These responses can help.

Objection	Response
“We can’t afford to stop - we have a deadline.”	You cannot afford not to stop. Every minute the build is red, you accumulate changes that are untested against the real baseline. Stopping for 20 minutes now prevents losing half a day later. The fastest path to your deadline runs through a green build.
“Stopping kills our velocity.”	Velocity that includes work built on a broken trunk is an illusion. Those story points will come back as rework, failed deployments, or production incidents. Real velocity requires a releasable trunk.
“We already stop all the time - it’s not working.”	Frequent stops indicate a different problem: the team is merging changes that break the build too often. Address that root cause with better pre-merge testing, smaller commits, and pair programming on risky changes. Stop-the-line is the safety net, not the solution for chronic build instability.
“It’s a known flaky test - we can ignore it.”	A flaky test you ignore trains the team to ignore all red builds. Fix the flaky test or remove it. There is no middle ground. A red build must always mean “something is wrong” or the signal loses all value.
“Management won’t support stopping feature work.”	Frame it in terms management cares about: lead time and rework cost. Show the two-timeline comparison above. Teams that stop immediately have shorter cycle times and less unplanned rework. This is not about being cautious - it is about being fast.

How Working Agreements Support the CD Migration

Each working agreement maps directly to a Phase 1 practice:

Practice	Supporting Agreement
Trunk-based development	CI agreement: daily integration, branch age < 24h
Testing fundamentals	DoD: all tests pass. CI: tests pass before merge
Build automation	CI: build runs on every commit. Broken build = top priority
Work decomposition	DoR: work items < 2 days. WIP limits
Code review	CI: review within 2 hours. DoD: code reviewed

Without these agreements, individual practices exist in isolation. Working agreements connect them into a coherent way of working.

Template: Create Your Own Working Agreements

Use this template as a starting point. Customize it for your team’s context. The specific targets may differ, but the structure should remain.

Team Working Agreement Template

Team Working Agreement Template

# [Team Name] Working Agreement
Date: [Date]
Participants: [All team members]

## Definition of Done
A work item is done when:
- [ ] Code is integrated to trunk
- [ ] All automated tests pass
- [ ] Code has been reviewed (method: [pair / mob / PR])
- [ ] The change is deployable to production
- [ ] No known defects are introduced
- [ ] [Add team-specific criteria]

## Definition of Ready
A work item is ready when:
- [ ] Acceptance criteria are defined (Given-When-Then)
- [ ] The item can be completed in [X] days or less
- [ ] The item is testable
- [ ] Dependencies are identified
- [ ] The team has discussed the item
- [ ] [Add team-specific criteria]

## CI Practices
- Integration frequency: at least [X] per developer per day
- Maximum branch age: [X] hours
- Review turnaround: within [X] hours
- Broken build response: fix within [X] minutes or revert
- WIP limit: [X] items per developer

## Review Practices
- Default review method: [pair / mob / async PR]
- PR size limit: [X] lines
- Review focus: [correctness, security, clarity]
- Style enforcement: [automated via linting]

## Meeting Cadence
- Standup: [time, frequency]
- Retrospective: [frequency]
- Working agreement review: [frequency, e.g., monthly]

## Agreement Review
This agreement is reviewed and updated [monthly / quarterly].
Any team member can propose changes at any time.
All changes require team consensus.

Tips for Creating Working Agreements

Include everyone. Every team member should participate in creating the agreement. Agreements imposed by a manager or tech lead are policies, not agreements.
Start simple. Do not try to cover every scenario. Start with the essentials (DoD, DoR, CI) and add specifics as the team identifies gaps.
Make them visible. Post the agreements where the team sees them daily - on a team wiki, in the team channel, or on a physical board.
Review regularly. Agreements should evolve as the team matures. Review them monthly. Remove agreements that are second nature. Add agreements for new challenges.
Enforce collectively. Working agreements are only effective if the team holds each other accountable. This is a team responsibility, not a manager responsibility.
Start with agreements you can keep. If the team is currently integrating once a week, do not agree to integrate three times daily. Agree to integrate daily, practice for a month, then tighten.

Measuring Success

Metric	Target	Why It Matters
Agreement adherence	Team self-reports > 80% adherence	Indicates agreements are realistic and followed
Agreement review frequency	Monthly	Ensures agreements stay relevant
Integration frequency	Meets CI agreement target	Validates the CI working agreement
Broken build fix time	Meets CI agreement target	Validates the broken build response agreement

Next Step

With working agreements in place, your team has established the foundations for continuous delivery: daily integration, reliable testing, automated builds, small work, fast review, and shared commitments.

You are ready to move to Phase 2: Pipeline, where you will build the automated path from commit to production.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Team Burnout - Symptom that clear agreements and sustainable practices help prevent
Unbounded WIP - Anti-pattern addressed by WIP limit agreements
Undone Work - Anti-pattern prevented by a strong Definition of Done
Deadline-Driven Development - Anti-pattern where pressure overrides team agreements
Velocity as Individual Metric - Anti-pattern that undermines collaborative working agreements
DORA Recommended Practices - Research-backed capabilities that working agreements support

2.7 - Everything as Code

Every artifact that defines your system - infrastructure, pipelines, configuration, database schemas, monitoring - belongs in version control and is delivered through pipelines.

Phase 1 - Foundations

If it is not in version control, it does not exist. If it is not delivered through a pipeline, it is a manual step. Manual steps block continuous delivery. This page establishes the principle that everything required to build, deploy, and operate your system is defined as code, version controlled, reviewed, and delivered through the same automated pipelines as your application.

The Principle

Continuous delivery requires that any change to your system - application code, infrastructure, pipeline configuration, database schema, monitoring rules, security policies - can be made through a single, consistent process: change the code, commit, let the pipeline deliver it.

When something is defined as code:

It is version controlled. You can see who changed what, when, and why. You can revert any change. You can trace any production state to a specific commit.
It is reviewed. Changes go through the same review process as application code. A second pair of eyes catches mistakes before they reach production.
It is tested. Automated validation catches errors before deployment. Linting, dry-runs, and policy checks apply to infrastructure the same way unit tests apply to application code.
It is reproducible. You can recreate any environment from scratch. Disaster recovery is “re-run the pipeline,” not “find the person who knows how to configure the server.”
It is delivered through a pipeline. No SSH, no clicking through UIs, no manual steps. The pipeline is the only path to production for everything, not just application code.

When something is not defined as code, it is a liability. It cannot be reviewed, tested, or reproduced. It exists only in someone’s head, a wiki page that is already outdated, or a configuration that was applied manually and has drifted from any documented state.

What “Everything” Means

Application code

This is where most teams start, and it is the least controversial. Your application source code is in version control, built and tested by a pipeline, and deployed as an immutable artifact.

If your application code is not in version control, start here. Nothing else in this page matters until this is in place.

Infrastructure

Every server, network, database instance, load balancer, DNS record, and cloud resource should be defined in code and provisioned through automation.

What this looks like:

Cloud resources defined in Terraform, Pulumi, CloudFormation, or similar tools
Server configuration managed by Ansible, Chef, Puppet, or container images
Network topology, firewall rules, and security groups defined declaratively
Environment creation is a pipeline run, not a ticket to another team

What this replaces:

Clicking through cloud provider consoles to create resources
SSH-ing into servers to install packages or change configuration
Filing tickets for another team to provision an environment
“Snowflake” servers that were configured by hand and nobody knows how to recreate

Why it matters for CD: If creating or modifying an environment requires manual steps, your deployment frequency is limited by the availability and speed of the person who performs those steps. If a production server fails and you cannot recreate it from code, your mean time to recovery is measured in hours or days instead of minutes.

Pipeline definitions

Your pipeline configuration belongs in the same repository as the code it builds and deploys. The pipeline is code, not a configuration applied through a UI.

What this looks like:

Pipeline definitions in .github/workflows/, .gitlab-ci.yml, Jenkinsfile, or equivalent
Pipeline changes go through the same review process as application code
Pipeline behavior is deterministic - the same commit always produces the same pipeline behavior
Teams can modify their own pipelines without filing tickets

What this replaces:

Pipeline configuration maintained through a Jenkins UI that nobody is allowed to touch
A “platform team” that owns all pipeline definitions and queues change requests
Pipeline behavior that varies depending on server state or installed plugins

Why it matters for CD: The pipeline is the path to production. If the pipeline itself cannot be changed through a reviewed, automated process, it becomes a bottleneck and a risk. Pipeline changes should flow with the same speed and safety as application changes.

Database schemas and migrations

Database schema changes should be defined as versioned migration scripts, stored in version control, and applied through the pipeline.

What this looks like:

Migration scripts in the repository (using tools like Flyway, Liquibase, Alembic, or ActiveRecord migrations)
Every schema change is a numbered, ordered migration that can be applied and rolled back
Migrations run as part of the deployment pipeline, not as a manual step
Schema changes follow the expand-then-contract pattern: add the new column, deploy code that uses it, then remove the old column in a later migration

What this replaces:

A DBA manually applying SQL scripts during a maintenance window
Schema changes that are “just done in production” and not tracked anywhere
Database state that has drifted from what is defined in any migration script

Why it matters for CD: Database changes are one of the most common reasons teams cannot deploy continuously. If schema changes require manual intervention, coordinated downtime, or a separate approval process, they become a bottleneck that forces batching. Treating schemas as code with automated migrations removes this bottleneck.

Application configuration

Environment-specific configuration - database connection strings, API endpoints, feature flag states, logging levels - should be defined as code and managed through version control.

What this looks like:

Configuration values stored in a config management system (Consul, AWS Parameter Store, environment variable definitions in infrastructure code)
Configuration changes are committed, reviewed, and deployed through a pipeline
The same application artifact is deployed to every environment; only the configuration differs

What this replaces:

Configuration files edited manually on servers
Environment variables set by hand and forgotten
Configuration that exists only in a deployment runbook

See Application Config for detailed guidance on externalizing configuration.

Monitoring, alerting, and observability

Dashboards, alert rules, SLO definitions, and logging configuration should be defined as code.

What this looks like:

Alert rules defined in Terraform, Prometheus rules files, or Datadog monitors-as-code
Dashboards defined as JSON or YAML, not built by hand in a UI
SLO definitions tracked in version control alongside the services they measure
Logging configuration (what to log, where to send it, retention policies) in code

What this replaces:

Dashboards built manually in a monitoring UI that nobody knows how to recreate
Alert rules that were configured by hand during an incident and never documented
Monitoring configuration that exists only on the monitoring server

Why it matters for CD: If you deploy ten times a day, you need to know instantly whether each deployment is healthy. If your monitoring and alerting configuration is manual, it will drift, break, or be incomplete. Monitoring-as-code ensures that every service has consistent, reviewed, reproducible observability.

Security policies

Security controls - access policies, network rules, secret rotation schedules, compliance checks - should be defined as code and enforced automatically.

What this looks like:

IAM policies and RBAC rules defined in Terraform or policy-as-code tools (OPA, Sentinel)
Security scanning integrated into the pipeline (SAST, dependency scanning, container image scanning)
Secret rotation automated and defined in code
Compliance checks that run on every commit, not once a quarter

What this replaces:

Security reviews that happen at the end of the development cycle
Access policies configured through UIs and never audited
Compliance as a manual checklist performed before each release

Why it matters for CD: Security and compliance requirements are the most common organizational blockers for CD. When security controls are defined as code and enforced by the pipeline, you can prove to auditors that every change passed security checks automatically. This is stronger evidence than a manual review, and it does not slow down delivery.

The “One Change, One Process” Test

For every type of artifact in your system, ask:

If I need to change this, do I commit a code change and let the pipeline deliver it?

If the answer is yes, the artifact is managed as code. If the answer involves SSH, a UI, a ticket to another team, or a manual step, it is not.

Artifact	Managed as code?	If not, the risk is…
Application source code	Usually yes	-
Infrastructure (servers, networks, cloud resources)	Often no	Snowflake environments, slow provisioning, unreproducible disasters
Pipeline definitions	Sometimes	Pipeline changes are slow, unreviewed, and risky
Database schemas	Sometimes	Schema changes require manual coordination and downtime
Application configuration	Sometimes	Config drift between environments, “works in staging” failures
Monitoring and alerting	Rarely	Monitoring gaps, unreproducible dashboards, alert fatigue
Security policies	Rarely	Security as a gate instead of a guardrail, audit failures

The goal is for every row in this table to be “yes.” You will not get there overnight, but every artifact you move from manual to code-managed removes a bottleneck and a risk.

How to Get There

Start with what blocks you most

Do not try to move everything to code at once. Identify the artifact type that causes the most pain or blocks deployments most frequently:

If environment provisioning takes days, start with infrastructure as code.
If database changes are the reason you cannot deploy more than once a week, start with schema migrations as code.
If pipeline changes require tickets to a platform team, start with pipeline as code.
If configuration drift causes production incidents, start with configuration as code.

Apply the same practices as application code

Once an artifact is defined as code, treat it with the same rigor as application code:

Store it in version control (ideally in the same repository as the application it supports)
Review changes before they are applied
Test changes automatically (linting, dry-runs, policy checks)
Deliver changes through a pipeline
Never modify the artifact outside of this process

Eliminate manual pathways

The hardest part is closing the manual back doors. As long as someone can SSH into a server and make a change, or click through a UI to modify infrastructure, the code-defined state will drift from reality.

The principle is the same as Single Path to Production for application code: the pipeline is the only way any change reaches production. This applies to infrastructure, configuration, schemas, monitoring, and policies just as much as it applies to application code.

Measuring Progress

Metric	What to look for
Artifact types managed as code	Track how many of the categories above are fully code-managed. The number should increase over time.
Manual changes to production	Count any change made outside of a pipeline (SSH, UI clicks, manual scripts). Target: zero.
Environment recreation time	How long does it take to recreate a production-like environment from scratch? Should decrease as more infrastructure moves to code.
Mean time to recovery	When infrastructure-as-code is in place, recovery from failures is “re-run the pipeline.” MTTR drops dramatically.

Build Automation - The build itself must be a single, version-controlled command
Single Path to Production - The pipeline is the only way changes reach production
Application Config - Externalize configuration from artifacts
Deterministic Pipeline - Same inputs, same outputs, every time
Production-Like Environments - Infrastructure-as-code enables environment parity

3 - Phase 2: Pipeline

Build the automated path from commit to production: a single, deterministic pipeline that deploys immutable artifacts.

Key question: “Can we deploy any commit automatically?”

This phase creates the delivery pipeline - the automated path that takes every commit through build, test, and deployment stages. When done right, the pipeline is the only way changes reach production.

What You’ll Do

Establish a single path to production - One pipeline for all changes
Make the pipeline deterministic - Same inputs always produce same outputs
Define “deployable” - Clear criteria for what’s ready to ship
Use immutable artifacts - Build once, deploy everywhere
Externalize application config - Separate config from code
Use production-like environments - Test in environments that match production
Design your pipeline architecture - Efficient quality gates for your context
Enable rollback - Fast recovery from any deployment

Why This Phase Matters

The pipeline is the backbone of continuous delivery. It replaces manual handoffs with automated quality gates, ensures every change goes through the same validation process, and makes deployment a routine, low-risk event.

When You’re Ready to Move On

You’re ready for Phase 3: Optimize when:

Every change reaches production through the same automated pipeline
The pipeline produces the same result for the same inputs
You can deploy any green build to production with confidence
Rollback takes minutes, not hours

Next: Phase 3 - Optimize - reduce batch size, improve flow, and make deployment routine.

Phase 1: Foundations - prerequisites to complete before starting the Pipeline phase
Phase 3: Optimize - the next phase after Pipeline is established
Slow Pipelines - a common symptom that pipeline architecture improvements address
Fear of Deploying - a cultural symptom that reliable rollback and automated pipelines help resolve
Missing Deployment Pipeline - the anti-pattern this entire phase eliminates
DORA Recommended Practices - industry-recognized capabilities that pipeline practices support
Pipeline Reference Architecture - concrete quality gate patterns organized by defect detection priority.

3.1 - Single Path to Production

All changes reach production through the same automated pipeline - no exceptions.

Phase 2 - Pipeline

Definition

A single path to production means that every change - whether it is a feature, a bug fix, a configuration update, or an infrastructure change - follows the same automated pipeline to reach production. There is exactly one route from a developer’s commit to a running production system. No side doors. No emergency shortcuts. No “just this once” manual deployments.

This is the most fundamental constraint of a continuous delivery pipeline. If you allow multiple paths, you cannot reason about the state of production. You lose the ability to guarantee that every change has been validated, and you undermine every other practice in this phase.

Why It Matters for CD Migration

Teams migrating to continuous delivery often carry legacy deployment processes - a manual runbook for “emergency” fixes, a separate path for database changes, or a distinct workflow for infrastructure updates. Each additional path is a source of unvalidated risk.

Establishing a single path to production is the first pipeline practice because every subsequent practice depends on it. A deterministic pipeline only works if all changes flow through it. Immutable artifacts are only trustworthy if no other mechanism can alter what reaches production. Your deployable definition is meaningless if changes can bypass the gates.

Key Principles

One pipeline for all changes

Every type of change uses the same pipeline:

Application code - features, fixes, refactors
Infrastructure as Code - Terraform, CloudFormation, Pulumi, Ansible
Pipeline definitions - the pipeline itself is versioned and deployed through the pipeline
Configuration changes - environment variables, feature flags, routing rules
Database migrations - schema changes, data migrations

Same pipeline for all environments

The pipeline that deploys to development is the same pipeline that deploys to staging and production. The only difference between environments is the configuration injected at deployment time. If your staging deployment uses a different mechanism than your production deployment, you are not testing the deployment process itself.

No manual deployments

If a human can bypass the pipeline and push a change directly to production, the single path is broken. This includes:

SSH access to production servers for ad-hoc changes
Direct container image pushes outside the pipeline
Console-based configuration changes that are not captured in version control
“Break glass” procedures that skip validation stages

Anti-Patterns

Integration branches and multi-branch deployment paths

Using separate branches (such as develop, release, hotfix) that each have their own deployment workflow creates multiple paths. GitFlow is a common source of this anti-pattern. When a hotfix branch deploys through a different pipeline than the develop branch, you cannot be confident that the hotfix has undergone the same validation.

Integration Branch:

Integration branch: parallel merge structure alongside trunk

trunk -> integration <- features

This creates two merge structures instead of one. When trunk changes, you merge to the integration branch immediately. When features change, you merge to integration at least daily. The integration branch lives a parallel life to trunk, acting as a temporary container for partially finished features. This attempts to mimic feature flags to keep inactive features out of production but adds complexity and accumulates abandoned features that stay unfinished forever.

GitFlow (multiple long-lived branches):

GitFlow: multiple long-lived branches with different merge paths per change type

master (production)
  |
develop (integration)
  |
feature branches -> develop
  |
release branches -> master
  |
hotfix branches -> master -> develop

GitFlow creates multiple merge patterns depending on change type:

Features: feature -> develop -> release -> master
Hotfixes: hotfix -> master AND hotfix -> develop
Releases: develop -> release -> master

Different types of changes follow different paths to production. Multiple long-lived branches (master, develop, release) create merge complexity. Hotfixes have a different path than features, release branches delay integration and create batch deployments, and merge conflicts multiply across integration points.

The correct approach is direct trunk integration - all work integrates directly to trunk using the same process:

Direct trunk integration: all changes follow the same path

trunk <- features
trunk <- bugfixes
trunk <- hotfixes

Environment-specific pipelines

Building a separate pipeline for staging versus production - or worse, manually deploying to staging and only using automation for production - means you are not testing your deployment process in lower environments.

“Emergency” manual deployments

The most dangerous anti-pattern is the manual deployment reserved for emergencies. Under pressure, teams bypass the pipeline “just this once,” introducing an unvalidated change into production. The fix for this is not to allow exceptions - it is to make the pipeline fast enough that it is always the fastest path to production.

Separate pipelines for different change types

Having one pipeline for application code, another for infrastructure, and yet another for database changes means that coordinated changes across these layers are never validated together.

Good Patterns

Feature flags

Use feature flags to decouple deployment from release. Code can be merged and deployed through the pipeline while the feature remains hidden behind a flag. This eliminates the need for long-lived branches and separate deployment paths for “not-ready” features.

Feature flag: deploy code to trunk while hiding it from users

// Feature code lives in trunk, controlled by flags
if (featureFlags.newCheckout) {
  return renderNewCheckout()
}
return renderOldCheckout()

Branch by abstraction

For large-scale refactors or technology migrations, use branch by abstraction to make incremental changes that can be deployed through the standard pipeline at every step. Create an abstraction layer, build the new implementation behind it, switch over incrementally, and remove the old implementation - all through the same pipeline.

Branch by abstraction: replace implementation behind a stable interface

// Old behavior behind abstraction
class PaymentProcessor {
  process() {
    // Gradually replace implementation while maintaining interface
  }
}

Dark launching

Deploy new functionality to production without exposing it to users. The code runs in production, processes real data, and generates real metrics - but its output is not shown to users. This validates the change under production conditions while managing risk.

Dark launching: deploy new API route without exposing it to users

// New API route exists but isn't exposed to users
router.post('/api/v2/checkout', newCheckoutHandler)

// Final commit: update client to use new route

Connect tests last

When building a new integration, start by deploying the code without connecting it to the live dependency. Validate the deployment, the configuration, and the basic behavior first. Connect to the real dependency as the final step. This keeps the change deployable through the pipeline at every stage of development.

Connect tests last: build and validate before wiring to UI

// Build new feature code, integrate to trunk
// Connect to UI/API only in final commit
function newCheckoutFlow() {
  // Complete implementation ready
}

// Final commit: wire it up
<button onClick={newCheckoutFlow}>Checkout</button>

How to Get Started

Step 1: Map your current deployment paths

Document every way that changes currently reach production. Include manual processes, scripts, pipelines, direct deployments, and any emergency procedures. You will likely find more paths than you expected.

Step 2: Identify the primary path

Choose or build one pipeline that will become the single path. This pipeline should be the most automated and well-tested path you have. All other paths will converge into it.

Step 3: Eliminate the easiest alternate paths first

Start by removing the deployment paths that are used least frequently or are easiest to replace. For each path you eliminate, migrate its changes into the primary pipeline.

Step 4: Make the pipeline fast enough for emergencies

The most common reason teams maintain manual deployment shortcuts is that the pipeline is too slow for urgent fixes. If your pipeline takes 45 minutes and an incident requires a fix in 10, the team will bypass the pipeline. Invest in pipeline speed so that the automated path is always the fastest option.

Step 5: Remove break-glass access

Once the pipeline is fast and reliable, remove the ability to deploy outside of it. Revoke direct production access. Disable manual deployment scripts. Make the pipeline the only way.

Example Implementation

Single Pipeline for Everything

Single pipeline for everything: GitHub Actions workflow from validate to production

# .github/workflows/deploy.yml
name: Deployment Pipeline

on:
  push:
    branches: [main]
  workflow_dispatch: # Manual trigger for rollbacks

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm ci
      - run: npm test
      - run: npm run lint
      - run: npm run security-scan

  build:
    needs: validate
    runs-on: ubuntu-latest
    steps:
      - run: npm run build
      - run: docker build -t app:${{ github.sha }} .
      - run: docker push app:${{ github.sha }}

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - run: kubectl set image deployment/app app=app:${{ github.sha }}
      - run: kubectl rollout status deployment/app

  smoke-test:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - run: npm run smoke-test:staging

  deploy-production:
    needs: smoke-test
    runs-on: ubuntu-latest
    steps:
      - run: kubectl set image deployment/app app=app:${{ github.sha }}
      - run: kubectl rollout status deployment/app

Every deployment - normal, hotfix, or rollback - uses this pipeline. Consistent, validated, traceable.

FAQ

What if the pipeline is broken and we need to deploy a critical fix?

Fix the pipeline first. If your pipeline is so fragile that it cannot deploy critical fixes, that is a pipeline problem, not a process problem. Invest in pipeline reliability.

What about emergency hotfixes that cannot wait for the full pipeline?

The pipeline should be fast enough to handle emergencies. If it is not, optimize the pipeline. A “fast-track” mode that skips some tests is acceptable, but it must still be the same pipeline, not a separate manual process.

Can we manually patch production “just this once”?

No. “Just this once” becomes “just this once again.” Manual production changes always create problems. Commit the fix, push through the pipeline, deploy.

What if deploying through the pipeline takes too long?

Optimize your pipeline:

Parallelize tests
Use faster test environments
Implement progressive deployment (canary, blue-green)
Cache dependencies
Optimize build times

A well-optimized pipeline should deploy to production in under 30 minutes.

Can operators make manual changes for maintenance?

Infrastructure maintenance (patching servers, scaling resources) is separate from application deployment. However, application deployment must still only happen through the pipeline.

Health Metrics

Pipeline deployment rate: Should be 100% (all deployments go through pipeline)
Manual override rate: Should be 0%
Hotfix pipeline time: Should be less than 30 minutes
Rollback success rate: Should be greater than 99%
Deployment frequency: Should increase over time as confidence grows

Connection to the Pipeline Phase

Single path to production is the foundation of Phase 2. Without it, every other pipeline practice is compromised:

Deterministic pipeline requires all changes to flow through it to provide guarantees
Deployable definition must be enforced by a single set of gates
Immutable artifacts are only trustworthy when produced by a known, consistent process
Rollback relies on the pipeline to deploy the previous version through the same path

Establishing this practice first creates the constraint that makes the rest of the pipeline meaningful.

Coordinated Deployments - a symptom that emerges when multiple deployment paths exist
Merge Freeze - a symptom of deployment processes that lack a single, trusted automated path
Manual Deployments - the anti-pattern that a single path to production eliminates
Missing Deployment Pipeline - the anti-pattern of having no automated delivery path at all
Deterministic Pipeline - the Pipeline practice that makes the single path reliable and trustworthy
Lead Time - a key metric that improves when all changes follow one automated path

3.2 - Deterministic Pipeline

The same inputs to the pipeline always produce the same outputs.

Phase 2 - Pipeline

Definition

A deterministic pipeline produces consistent, repeatable results. Given the same commit, the same environment definition, and the same configuration, the pipeline will build the same artifact, run the same tests, and produce the same outcome - every time. There is no variance introduced by uncontrolled dependencies, environmental drift, manual intervention, or non-deterministic test behavior.

Determinism is what transforms a pipeline from “a script that usually works” into a reliable delivery system. When the pipeline is deterministic, a green build means something. A failed build points to a real problem. Teams can trust the signal.

Why It Matters for CD Migration

Non-deterministic pipelines are the single largest source of wasted time in delivery organizations. When builds fail randomly, teams learn to ignore failures. When the same commit passes on retry, teams stop investigating root causes. When different environments produce different results, teams lose confidence in pre-production validation.

During a CD migration, teams are building trust in automation. Every flaky test, every “works on my machine” failure, and every environment-specific inconsistency erodes that trust. A deterministic pipeline is what earns the team’s confidence that automation can replace manual verification.

Key Principles

Version control everything

Every input to the pipeline must be version controlled:

Application source code - the obvious one
Infrastructure as Code - the environment definitions themselves
Pipeline definitions - the pipeline configuration files
Test data and fixtures - the data used by automated tests
Dependency lockfiles - exact versions of every dependency (e.g., package-lock.json, Pipfile.lock, go.sum)
Tool versions - the versions of compilers, runtimes, linters, and build tools

If an input to the pipeline is not version controlled, it can change without notice, and the pipeline is no longer deterministic.

Lock dependency versions

Floating dependency versions (version ranges, “latest” tags) are a common source of non-determinism. A build that worked yesterday can break today because a transitive dependency released a new version overnight.

Use lockfiles to pin exact versions of every dependency. Commit lockfiles to version control. Update dependencies intentionally through pull requests, not implicitly through builds.

Eliminate environmental variance

The pipeline should run in a controlled, reproducible environment. Containerize build steps so that the build environment is defined in code and does not drift over time. Use the same base images in CI as in production. Pin tool versions explicitly rather than relying on whatever is installed on the build agent.

Remove human intervention

Any manual step in the pipeline is a source of variance. A human choosing which tests to run, deciding whether to skip a stage, or manually approving a step introduces non-determinism. The pipeline should run from commit to deployment without human decisions.

This does not mean humans have no role - it means the pipeline’s behavior is fully determined by its inputs, not by who is watching it run.

Fix flaky tests immediately

A flaky test is a test that sometimes passes and sometimes fails for the same code. Flaky tests are the most insidious form of non-determinism because they train teams to distrust the test suite.

When a flaky test is detected, the response must be immediate:

Quarantine the test - remove it from the pipeline so it does not block other changes
Fix it or delete it - flaky tests provide negative value; they are worse than no test
Investigate the root cause - flakiness often indicates a real problem (race conditions, shared state, time dependencies, external service reliance)

Never allow a culture of “just re-run it” to take hold. Every re-run masks a real problem.

Example: Non-Deterministic vs Deterministic Pipeline

Seeing anti-patterns and good patterns side by side makes the difference concrete.

Anti-Pattern: Non-Deterministic Pipeline

Anti-pattern: non-deterministic pipeline with floating versions and manual steps

# Bad: Uses floating versions
dependencies:
  nodejs: "latest"
  postgres: "14"  # No minor/patch version

# Bad: Relies on external state
test:
  - curl https://api.example.com/test-data
  - run_tests --use-production-data

# Bad: Time-dependent tests
test('shows current date', () => {
  expect(getDate()).toBe(new Date())  # Fails at midnight!
})

# Bad: Manual steps
deploy:
  - echo "Manually verify staging before approving"
  - wait_for_approval

Results vary based on when the pipeline runs, what is in production, which dependency versions are “latest,” and human availability.

Good Pattern: Deterministic Pipeline

Good pattern: deterministic pipeline with pinned versions and automated verification

# Good: Pinned versions
dependencies:
  nodejs: "18.17.1"
  postgres: "14.9"

# Good: Version-controlled test data
test:
  - docker-compose up -d
  - ./scripts/seed-test-data.sh  # From version control
  - npm run test

# Good: Deterministic time handling
test('shows date', () => {
  const mockDate = new Date('2024-01-15')
  jest.useFakeTimers().setSystemTime(mockDate)
  expect(getDate()).toBe(mockDate)
})

# Good: Automated verification
deploy:
  - deploy_to_staging
  - run_smoke_tests
  - if: smoke_tests_pass
    deploy_to_production

Same inputs always produce same outputs. Pipeline results are trustworthy and reproducible.

Anti-Patterns

Unpinned dependencies

Using version ranges like ^1.2.0 or >=2.0 in dependency declarations without a lockfile means the build resolves different versions on different days. This applies to application dependencies, build plugins, CI tool versions, and base container images.

Shared, mutable build environments

Build agents that accumulate state between builds (cached files, installed packages, leftover containers) produce different results depending on what ran previously. Each build should start from a clean, known state.

Tests that depend on external services

Tests that call live external APIs, depend on shared databases, or rely on network resources introduce uncontrolled variance. External services change, experience outages, and respond with different latency - all of which make the pipeline non-deterministic.

Time-dependent tests

Tests that depend on the current time, current date, or elapsed time are inherently non-deterministic. A test that passes at 2:00 PM and fails at midnight is not testing your application - it is testing the clock.

Manual retry culture

Teams that routinely re-run failed pipelines without investigating the failure have accepted non-determinism as normal. This is a cultural anti-pattern that must be addressed alongside the technical ones.

Good Patterns

Containerized build environments

Define your build environment as a container image. Pin the base image version. Install exact versions of all tools. Run every build in a fresh instance of this container. This eliminates variance from the build environment.

Hermetic builds

A hermetic build is one that does not access the network during the build process. All dependencies are pre-fetched and cached. The build can run identically on any machine, at any time, with or without network access.

Contract tests for external dependencies

Replace live calls to external services with contract tests. These tests verify that your code interacts correctly with an external service’s API contract without actually calling the service. Combine with service virtualization or test doubles for integration tests.

Deterministic test ordering

Run tests in a fixed, deterministic order - or better, ensure every test is independent and can run in any order. Many test frameworks default to random ordering to detect inter-test dependencies; use this during development but ensure no ordering dependencies exist.

Immutable CI infrastructure

Treat CI build agents as cattle, not pets. Provision them from images. Replace them rather than updating them. Never allow state to accumulate on a build agent between pipeline runs.

Tactical Patterns

Immutable Build Containers

Define your build environment as a versioned container image with every dependency pinned:

Immutable build container: Dockerfile with pinned base image and tools

# Dockerfile.build - version controlled
FROM node:18.17.1-alpine3.18

RUN apk add --no-cache \
    python3=3.11.5-r0 \
    make=4.4.1-r1

WORKDIR /app
COPY package-lock.json .
RUN npm ci --frozen-lockfile

Every build runs inside a fresh instance of this image. No drift, no accumulated state.

Dependency Lockfiles

Always use dependency lockfiles. This is essential for deterministic builds:

Dependency lockfile: package-lock.json with pinned exact versions

// package-lock.json (ALWAYS commit to version control)
{
  "dependencies": {
    "express": {
      "version": "4.18.2",
      "resolved": "https://registry.npmjs.org/express/-/express-4.18.2.tgz",
      "integrity": "sha512-5/PsL6iGPdfQ/..."
    }
  }
}

Rules for lockfiles:

Use npm ci in CI (not npm install) - npm ci installs exactly what the lockfile specifies
Never add lockfiles to .gitignore - they must be committed
Avoid version ranges in production dependencies - no ^, ~, or >= without a lockfile enforcing exact resolution
Never rely on “latest” tags for any dependency, base image, or tool

Quarantine Pattern for Flaky Tests

When a flaky test is detected, move it to quarantine immediately. Do not leave it in the main suite where it erodes trust in the pipeline:

Quarantine pattern: skip and annotate flaky tests with tracking info

// tests/quarantine/flaky-test.spec.js
describe.skip('Quarantined: Flaky integration test', () => {
  // Quarantined due to intermittent timeout
  // Tracking issue: #1234
  // Fix deadline: 2024-02-01
  it('should respond within timeout', () => {
    // Test code
  })
})

Quarantine is not a permanent home. Every quarantined test must have:

A tracking issue linked in the test file
A deadline for resolution (no more than one sprint)
A clear root cause investigation plan

If a quarantined test cannot be fixed by the deadline, delete it and write a better test.

Hermetic Test Environments

Give each pipeline run a fresh, isolated environment with no shared state:

Hermetic test environment: GitHub Actions with fresh isolated database per run

# GitHub Actions example
jobs:
  test:
    runs-on: ubuntu-22.04
    services:
      postgres:
        image: postgres:14.9
        env:
          POSTGRES_DB: testdb
          POSTGRES_PASSWORD: testpass
    steps:
      - uses: actions/checkout@v3
      - run: npm ci
      - run: npm test
      # Each workflow run gets a fresh database

How to Get Started

Step 1: Audit your pipeline inputs

List every input to your pipeline that is not version controlled. This includes dependency versions, tool versions, environment configurations, test data, and pipeline definitions themselves.

Step 2: Add lockfiles and pin versions

For every dependency manager in your project, ensure a lockfile is committed to version control. Pin CI tool versions explicitly. Pin base image versions in Dockerfiles.

Step 3: Containerize the build

Move your build steps into containers with explicitly defined environments. This is often the highest-leverage change for improving determinism.

Step 4: Identify and fix flaky tests

Review your test history for tests that have both passed and failed for the same commit. Quarantine them immediately and fix or remove them within a defined time window (such as one sprint).

Step 5: Monitor pipeline determinism

Track the rate of pipeline failures that are resolved by re-running without code changes. This metric (sometimes called the “re-run rate”) directly measures non-determinism. Drive it to zero.

FAQ

What if a test is occasionally flaky but hard to reproduce?

This is still a problem. Flaky tests indicate either a real bug in your code (race conditions, shared state) or a problem with your test (dependency on external state, timing sensitivity). Both need to be fixed. Quarantine the test, investigate thoroughly, and fix the root cause.

Can we use retries to handle flaky tests?

Retries mask problems rather than fixing them. A test that passes on retry is hiding a failure, not succeeding. Fix the flakiness instead of retrying.

How do we handle tests that involve randomness?

Seed your random number generators with a fixed seed in tests:

Deterministic randomness: fixed seed for predictable test results

// Deterministic randomness
const rng = new Random(12345) // Fixed seed
const result = shuffle(array, rng)
expect(result).toEqual([3, 1, 4, 2]) // Predictable

What if our deployment requires manual verification?

Manual verification can happen after deployment, not before. Deploy automatically based on pipeline results, then verify in production using automated smoke tests or observability tooling. If verification fails, roll back automatically.

Should the pipeline ever be non-deterministic?

There are rare cases where controlled non-determinism is useful (chaos engineering, fuzz testing), but these should be:

Explicitly designed and documented
Separate from the core deployment pipeline
Reproducible via saved seeds or recorded inputs

Health Metrics

Track these metrics to measure your pipeline’s determinism:

Test flakiness rate - percentage of test runs that produce different results for the same commit. Target less than 1%, ideally zero.
Pipeline re-run rate - percentage of pipeline failures resolved by re-running without code changes. This directly measures non-determinism. Target zero.
Time to fix flaky tests - elapsed time from detection to resolution. Target less than one day.
Manual override rate - how often someone manually approves, skips, or re-runs a stage. Target near zero.

Connection to the Pipeline Phase

Determinism is what gives the single path to production its authority. If the pipeline produces inconsistent results, teams will work around it. A deterministic pipeline is also the prerequisite for a meaningful deployable definition - your quality gates are only as reliable as the pipeline that enforces them.

When the pipeline is deterministic, immutable artifacts become trustworthy: you know that the artifact was built by a consistent, repeatable process, and its validation results are real.

Flaky Tests - the most common source of non-determinism in pipelines
Environment-Dependent Failures - failures caused by uncontrolled environmental variance
Slow Pipelines - often worsened by re-runs of non-deterministic failures
Snowflake Environments - an anti-pattern that introduces environmental variance into the pipeline
Immutable Artifacts - the Pipeline practice that depends on deterministic builds to be trustworthy
Build Duration - a metric directly affected by pipeline determinism and re-run rates

3.3 - Deployable Definition

Clear, automated criteria that determine when a change is ready for production.

Phase 2 - Pipeline

Definition

A deployable definition is the set of automated quality criteria that every artifact must satisfy before it is considered ready for production. It is the pipeline’s answer to the question: “How do we know this is safe to deploy?”

This is not a checklist that a human reviews. It is a set of automated gates - executable validations built into the pipeline - that every change must pass. If the pipeline is green, the artifact is deployable. If the pipeline is red, it is not. There is no ambiguity, no judgment call, and no “looks good enough.”

Why It Matters for CD Migration

Without a clear, automated deployable definition, teams rely on human judgment to decide when something is ready to ship. This creates bottlenecks (waiting for approval), variance (different people apply different standards), and fear (nobody is confident the change is safe). All three are enemies of continuous delivery.

During a CD migration, the deployable definition replaces manual approval processes with automated confidence. It is what allows a team to say “any green build can go to production” - which is the prerequisite for continuous deployment.

Key Principles

The definition must be automated

Every criterion in the deployable definition is enforced by an automated check in the pipeline. If a requirement cannot be automated, either find a way to automate it or question whether it belongs in the deployment path.

The definition must be comprehensive

The deployable definition should cover all dimensions of quality that matter for production readiness:

Security

Static Application Security Testing (SAST) - scan source code for known vulnerability patterns
Dependency vulnerability scanning - check all dependencies against known vulnerability databases (CVE lists)
Secret detection - verify that no credentials, API keys, or tokens are present in the codebase
Container image scanning - if deploying containers, scan images for known vulnerabilities
License compliance - verify that dependency licenses are compatible with your distribution requirements

Functionality

Unit tests - fast, isolated tests that verify individual components behave correctly
Integration tests - tests that verify components work together correctly
End-to-end tests - tests that verify the system works from the user’s perspective
Regression tests - tests that verify previously fixed defects have not reappeared
Contract tests - tests that verify APIs conform to their published contracts

Compliance

Audit trail - the pipeline itself produces the compliance artifact: who changed what, when, and what validations it passed
Policy as code - organizational policies (e.g., “no deployments on Friday”) encoded as pipeline logic
Change documentation - automatically generated from commit metadata and pipeline results

Performance

Performance benchmarks - verify that key operations complete within acceptable thresholds
Load test baselines - verify that the system handles expected load without degradation
Resource utilization checks - verify that the change does not introduce memory leaks or excessive CPU usage

Reliability

Health check validation - verify that the application starts up correctly and responds to health checks
Graceful degradation tests - verify that the system behaves acceptably when dependencies fail
Rollback verification - verify that the deployment can be rolled back (see Rollback)

Code Quality

Linting and static analysis - enforce code style and detect common errors
Code coverage thresholds - not as a target, but as a safety net to detect large untested areas
Complexity metrics - flag code that exceeds complexity thresholds for review

The definition must be fast

A deployable definition that takes hours to evaluate will not support continuous delivery. The entire pipeline - including all deployable definition checks - should complete in minutes, not hours. This often requires running checks in parallel, investing in test infrastructure, and making hard choices about which slow checks provide enough value to keep.

The definition must be maintained

The deployable definition is a living document. As the system evolves, new failure modes emerge, and the definition should be updated to catch them. When a production incident occurs, the team should ask: “What automated check could have caught this?” and add it to the definition.

Anti-Patterns

Manual approval gates

Requiring a human to review and approve a deployment after the pipeline has passed all automated checks is an anti-pattern. It adds latency, creates bottlenecks, and implies that the automated checks are not sufficient. If a human must approve, it means your automated definition is incomplete - fix the definition rather than adding a manual gate.

“Good enough” tolerance

Allowing deployments when some checks fail because “that test always fails” or “it is only a warning” degrades the deployable definition to meaninglessness. Either the check matters and must pass, or it does not matter and should be removed.

Post-deployment validation only

Running validation only after deployment to production (production smoke tests, manual QA in production) means you are using production users to find problems. Pre-deployment validation must be comprehensive enough that post-deployment checks are a safety net, not the primary quality gate.

Inconsistent definitions across teams

When different teams have different deployable definitions, organizational confidence in deployment varies. While the specific checks may differ by service, the categories of validation (security, functionality, performance, compliance) should be consistent.

Good Patterns

Pipeline gates as policy

Encode the deployable definition as pipeline stages that block progression. A change cannot move from build to test, or from test to deployment, unless the preceding stage passes completely. The pipeline enforces the definition; no human override is possible.

Shift-left validation

Run the fastest, most frequently failing checks first. Unit tests and linting run before integration tests. Integration tests run before end-to-end tests. Security scans run in parallel with test stages. This gives developers the fastest possible feedback.

Continuous definition improvement

After every production incident, add or improve a check in the deployable definition that would have caught the issue. Over time, the definition becomes a comprehensive record of everything the team has learned about quality.

Progressive quality gates

Structure the pipeline to fail fast on quick checks, then run progressively more expensive validations. This gives developers the fastest possible feedback while still running comprehensive checks:

Progressive quality gates: three pipeline stages by speed

Stage 1: Fast Feedback (< 5 min)
  - Linting
  - Unit tests
  - Security scan

Stage 2: Integration (< 15 min)
  - Integration tests
  - Database migrations
  - API contract tests

Stage 3: Comprehensive (< 30 min)
  - E2E tests
  - Performance tests
  - Compliance checks

Each stage acts as a gate. If Stage 1 fails, the pipeline stops immediately rather than wasting time on slower checks that will not matter.

Context-specific definitions

While the categories of validation should be consistent across the organization, the specific checks may vary by deployment target. Define a base set of checks that always apply, then layer additional checks for higher-risk environments:

Context-specific deployable definitions: base, production, and feature branch

# Base definition (always required)
base_deployable:
  - unit_tests: pass
  - security_scan: pass
  - code_coverage: >= 80%

# Production-specific (additional requirements)
production_deployable:
  - load_tests: pass
  - disaster_recovery_tested: true
  - runbook_updated: true

# Feature branch (relaxed for experimentation)
feature_deployable:
  - unit_tests: pass
  - security_scan: no_critical

This approach lets teams move fast during development while maintaining rigorous standards for production deployments.

Error budget approach

Use error budgets to connect the deployable definition to production reliability. When the service is within its error budget, the pipeline allows normal deployment. When the error budget is exhausted, the pipeline shifts focus to reliability work:

Error budget approach: deployment criteria tied to reliability

definition_of_deployable:
  error_budget_remaining: > 0
  slo_compliance: >= 99.9%
  recent_incidents: < 2 per week

This creates a self-correcting system. Teams that ship changes causing incidents consume their error budget, which automatically tightens the deployment criteria until reliability improves.

Visible, shared definitions

Make the deployable definition visible to all team members. Display the current pipeline status on dashboards. When a check fails, provide clear, actionable feedback about what failed and why. The definition should be understood by everyone, not hidden in pipeline configuration.

How to Get Started

Step 1: Document your current “definition of done”

Write down every check that currently happens before a deployment - automated or manual. Include formal checks (tests, scans) and informal ones (someone eyeballs the logs, someone clicks through the UI).

Step 2: Classify each check

For each check, determine: Is it automated? Is it fast? Is it reliable? Is it actually catching real problems? This reveals which checks are already pipeline-ready and which need work.

Step 3: Automate the manual checks

For every manual check, determine how to automate it. A human clicking through the UI becomes an end-to-end test. A human reviewing logs becomes an automated log analysis step. A manager approving a deployment becomes a set of automated policy checks.

Step 4: Build the pipeline gates

Organize your automated checks into pipeline stages. Fast checks first, slower checks later. All checks must pass for the artifact to be considered deployable.

Step 5: Remove manual approvals

Once the automated definition is comprehensive enough that a green build genuinely means “safe to deploy,” remove manual approval gates. This is often the most culturally challenging step.

Connection to the Pipeline Phase

The deployable definition is the contract between the pipeline and the organization. It is what makes the single path to production trustworthy - because every change that passes through the path has been validated against a clear, comprehensive standard.

Combined with a deterministic pipeline, the deployable definition ensures that green means green and red means red. Combined with immutable artifacts, it ensures that the artifact you validated is the artifact you deploy. It is the bridge between automated process and organizational confidence.

Health Metrics

Track these metrics to evaluate whether your deployable definition is well-calibrated:

Pipeline pass rate - should be 70-90%. Too high suggests tests are too lax and not catching real problems. Too low suggests tests are too strict or too flaky, causing unnecessary rework.
Pipeline execution time - should be under 30 minutes for full validation. Longer pipelines slow feedback and discourage frequent commits.
Production incident rate - should decrease over time as the definition improves and catches more failure modes before deployment.
Manual override rate - should be near zero. Frequent manual overrides indicate the automated definition is incomplete or that the team does not trust it.

FAQ

Who decides what goes in the deployable definition?

The entire team - developers, QA, operations, security, and product - should collaboratively define these standards. The definition should reflect genuine risks and requirements, not arbitrary bureaucracy. If a check does not prevent a real production problem, question whether it belongs.

What if the pipeline passes but a bug reaches production?

This indicates a gap in the deployable definition. Add a test that catches that class of failure in the future. Over time, every production incident should result in a stronger definition. This is how the definition becomes a comprehensive record of everything the team has learned about quality.

Can we skip pipeline checks for urgent hotfixes?

No. If the pipeline cannot validate a hotfix quickly enough, the problem is with the pipeline, not the process. Fix the pipeline speed rather than bypassing quality checks. Bypassing checks for “urgent” changes is how critical bugs compound in production.

How strict should the definition be?

Strict enough to prevent production incidents, but not so strict that it becomes a bottleneck. If the pipeline rejects 90% of commits, standards may be too rigid or tests may be too flaky. If production incidents are frequent, standards are too lax. Use the health metrics above to calibrate.

Should manual testing be part of the definition?

Manual exploratory testing is valuable for discovering edge cases, but it should inform the definition, not be the definition. When manual testing discovers a defect, automate a test for that failure mode. Over time, manual testing shifts from gatekeeping to exploration.

What about requirements that cannot be tested automatically?

Some requirements - like UX quality or nuanced accessibility - are harder to automate fully. For these:

Automate what you can (accessibility scanners, visual regression tests)
Make remaining manual checks lightweight and concurrent, not deployment blockers
Continuously work to automate more as tooling improves

Hardening Sprints - a symptom indicating the deployable definition is incomplete, forcing manual quality efforts before release
Infrequent Releases - often caused by unclear or manual criteria for what is ready to ship
Manual Deployments - an anti-pattern that automated quality gates in the deployable definition replace
Deterministic Pipeline - the Pipeline practice that ensures deployable definition checks produce reliable results
Change Fail Rate - a key metric that improves as the deployable definition becomes more comprehensive
Testing Fundamentals - the Foundations practice that provides the test suite enforced by the deployable definition

3.4 - Immutable Artifacts

Build once, deploy everywhere. The same artifact is used in every environment.

Phase 2 - Pipeline

Definition

An immutable artifact is a build output that is created exactly once and deployed to every environment without modification. The binary, container image, or package that runs in production is byte-for-byte identical to the one that passed through testing. Nothing is recompiled, repackaged, or altered between environments.

“Build once, deploy everywhere” is the core principle. The artifact is sealed at build time. Configuration is injected at deployment time (see Application Configuration), but the artifact itself never changes.

Why It Matters for CD Migration

If you build a separate artifact for each environment - or worse, make manual adjustments to artifacts at deployment time - you can never be certain that what you tested is what you deployed. Every rebuild introduces the possibility of variance: a different dependency resolved, a different compiler flag applied, a different snapshot of the source.

Immutable artifacts eliminate an entire class of “works in staging, fails in production” problems. They provide confidence that the pipeline results are real: the artifact that passed every quality gate is the exact artifact running in production.

For teams migrating to CD, this practice is a concrete, mechanical step that delivers immediate trust. Once the team sees that the same container image flows from CI to staging to production, the deployment process becomes verifiable instead of hopeful.

Key Principles

Build once

The artifact is produced exactly once, during the build stage of the pipeline. It is stored in an artifact repository (such as a container registry, Maven repository, npm registry, or object store) and every subsequent stage of the pipeline - and every environment - pulls and deploys that same artifact.

No manual adjustments

Artifacts are never modified after creation. This means:

No recompilation for different environments
No patching binaries in staging to fix a test failure
No adding environment-specific files into a container image after the build
No editing properties files inside a deployed artifact

Version everything that goes into the build

Because the artifact is built once and cannot be changed, every input must be correct at build time:

Source code - committed to version control at a specific commit hash
Dependencies - locked to exact versions via lockfiles
Build tools - pinned to specific versions
Build configuration - stored in version control alongside the source

Tag and trace

Every artifact must be traceable back to the exact commit, pipeline run, and set of inputs that produced it. Use content-addressable identifiers (such as container image digests), semantic version tags, or build metadata that links the artifact to its source.

Anti-Patterns

Rebuilding per environment

Building the artifact separately for development, staging, and production - even from the same source - means each artifact is a different build. Different builds can produce different results due to non-deterministic build processes, updated dependencies, or changed build environments.

SNAPSHOT or mutable versions

Using version identifiers like -SNAPSHOT (Maven), latest (container images), or unversioned “current” references means the same version label can point to different artifacts at different times. This makes it impossible to know exactly what is deployed. This applies to both the artifacts you produce and the dependencies you consume. A dependency pinned to a -SNAPSHOT version can change underneath you between builds, silently altering your artifact’s behavior without any version change. Version numbers are cheap - assign a new one for every meaningful change rather than reusing a mutable label.

Manual intervention at failure points

When a deployment fails, the fix must go through the pipeline. Manually patching the artifact, restarting with modified configuration, or applying a hotfix directly to the running system breaks immutability and bypasses the quality gates.

Environment-specific builds

Build scripts that use conditionals like “if production, include X” create environment-coupled artifacts. The artifact should be environment-agnostic; environment configuration handles the differences.

Artifacts that self-modify

Applications that write to their own deployment directory, modify their own configuration files at runtime, or store state alongside the application binary are not truly immutable. Runtime state must be stored externally.

Good Patterns

Container images as immutable artifacts

Container images are an excellent vehicle for immutable artifacts. A container image built in CI, pushed to a registry with a content-addressable digest, and pulled into each environment is inherently immutable. The image that ran in staging is provably identical to the image running in production.

Artifact promotion

Instead of rebuilding for each environment, promote the same artifact through environments. The pipeline builds the artifact once, deploys it to a test environment, validates it, then promotes it (deploys the same artifact) to staging, then production. The artifact never changes; only the environment it runs in changes.

Content-addressable storage

Use content-addressable identifiers (SHA-256 digests, content hashes) rather than mutable tags as the primary artifact reference. A content-addressed artifact is immutable by definition: changing any byte changes the address.

Signed artifacts

Digitally sign artifacts at build time and verify the signature before deployment. This guarantees that the artifact has not been tampered with between the build and the deployment. This is especially important for supply chain security.

Reproducible builds

Strive for builds where the same source input produces a bit-for-bit identical artifact. While not always achievable (timestamps, non-deterministic linkers), getting close makes it possible to verify that an artifact was produced from its claimed source.

How to Get Started

Step 1: Separate build from deployment

If your pipeline currently rebuilds for each environment, restructure it into two distinct phases: a build phase that produces a single artifact, and a deployment phase that takes that artifact and deploys it to a target environment with the appropriate configuration.

Step 2: Set up an artifact repository

Choose an artifact repository appropriate for your technology stack - a container registry for container images, a package registry for libraries, or an object store for compiled binaries. All downstream pipeline stages pull from this repository.

Step 3: Eliminate mutable version references

Replace latest tags, -SNAPSHOT versions, and any other mutable version identifier with immutable references. Use commit-hash-based tags, semantic versions, or content-addressable digests.

Step 4: Implement artifact promotion

Modify your pipeline to deploy the same artifact to each environment in sequence. The pipeline should pull the artifact from the repository by its immutable identifier and deploy it without modification.

Step 5: Add traceability

Ensure every deployed artifact can be traced back to its source commit, build log, and pipeline run. Label container images with build metadata. Store build provenance alongside the artifact in the repository.

Step 6: Verify immutability

Periodically verify that what is running in production matches what the pipeline built. Compare image digests, checksums, or signatures. This catches any manual modifications that may have bypassed the pipeline.

Connection to the Pipeline Phase

Immutable artifacts are the physical manifestation of trust in the pipeline. The single path to production ensures all changes flow through the pipeline. The deterministic pipeline ensures the build is repeatable. The deployable definition ensures the artifact meets quality criteria. Immutability ensures that the validated artifact - and only that artifact - reaches production.

This practice also directly supports rollback: because previous artifacts are stored unchanged in the artifact repository, rolling back is simply deploying a previous known-good artifact.

Staging Passes, Production Fails - a symptom eliminated when the same artifact is deployed to every environment
Snowflake Environments - an anti-pattern that undermines artifact immutability through environment-specific builds
Application Configuration - the Pipeline practice that enables immutability by externalizing environment-specific values
Deterministic Pipeline - the Pipeline practice that ensures the build process itself is repeatable
Rollback - the Pipeline practice that relies on stored immutable artifacts for fast recovery
Change Fail Rate - a metric that improves when validated artifacts are deployed without modification

3.5 - Application Configuration

Separate configuration from code so the same artifact works in every environment.

Phase 2 - Pipeline

Definition

Application configuration is the practice of correctly separating what varies between environments from what does not, so that a single immutable artifact can run in any environment. This distinction - drawn from the Twelve-Factor App methodology - is essential for continuous delivery.

There are two distinct types of configuration:

Application config - settings that define how the application behaves, are the same in every environment, and should be bundled with the artifact. Examples: routing rules, feature flag defaults, serialization formats, timeout policies, retry strategies.
Environment config - settings that vary by deployment target and must be injected at deployment time. Examples: database connection strings, API endpoint URLs, credentials, resource limits, logging levels for that environment.

Getting this distinction right is critical. Bundling environment config into the artifact breaks immutability. Externalizing application config that does not vary creates unnecessary complexity and fragility.

Why It Matters for CD Migration

Configuration is where many CD migrations stall. Teams that have been deploying manually often have configuration tangled with code - hardcoded URLs, environment-specific build profiles, configuration files that are manually edited during deployment. Untangling this is a prerequisite for immutable artifacts and automated deployments.

When configuration is handled correctly, the same artifact flows through every environment without modification, environment-specific values are injected at deployment time, and feature behavior can be changed without redeploying. This enables the deployment speed and safety that continuous delivery requires.

Key Principles

Bundle what does not vary

Application configuration that is identical across all environments belongs inside the artifact. This includes:

Default feature flag values - the static, compile-time defaults for feature flags
Application routing and mapping rules - URL patterns, API route definitions
Serialization and encoding settings - JSON configuration, character encoding
Internal timeout and retry policies - backoff strategies, circuit breaker thresholds
Validation rules - input validation constraints, business rule parameters

These values are part of the application’s behavior definition. They should be version controlled with the source code and deployed as part of the artifact.

Externalize what varies

Environment configuration that changes between deployment targets must be injected at deployment time:

Database connection strings - different databases for test, staging, production
External service URLs - different endpoints for downstream dependencies
Credentials and secrets - always injected, never bundled, never in version control
Resource limits - memory, CPU, connection pool sizes tuned per environment
Environment-specific logging levels - verbose in development, structured in production
Feature flag overrides - dynamic flag values managed by an external flag service

Feature flags: static vs. dynamic

Feature flags deserve special attention because they span both categories:

Static feature flags - compiled into the artifact as default values. They define the initial state of a feature when the application starts. Changing them requires a new build and deployment.
Dynamic feature flags - read from an external service at runtime. They can be toggled without deploying. Use these for operational toggles (kill switches, gradual rollouts) and experiment flags (A/B tests).

A well-designed feature flag system uses static defaults (bundled in the artifact) that can be overridden by a dynamic source (external flag service). If the flag service is unavailable, the application falls back to its static defaults - a safe, predictable behavior.

Anti-Patterns

Hardcoded environment-specific values

Database URLs, API endpoints, or credentials embedded directly in source code or configuration files that are baked into the artifact. This forces a different build per environment and makes secrets visible in version control.

Externalizing everything

Moving all configuration to an external service - including values that never change between environments - creates unnecessary runtime dependencies. If the configuration service is down and a value that is identical in every environment cannot be read, the application fails to start for no good reason.

Environment-specific build profiles

Build systems that use profiles like mvn package -P production or Webpack configurations that toggle behavior based on NODE_ENV at build time create environment-coupled artifacts. The artifact must be the same regardless of where it will run.

Configuration files edited during deployment

Manually editing application.properties, .env files, or YAML configurations on the server during or after deployment is error-prone, unrepeatable, and invisible to the pipeline. All configuration injection must be automated.

Secrets in version control

Credentials, API keys, certificates, and tokens must never be stored in version control - not even in “private” repositories, not even encrypted with simple mechanisms. Use a secrets manager (Vault, AWS Secrets Manager, Azure Key Vault) and inject secrets at deployment time.

Good Patterns

Environment variables for environment config

Following the Twelve-Factor App approach, inject environment-specific values as environment variables. This is universally supported across languages and platforms, works with containers and orchestrators, and keeps the artifact clean.

Layered configuration

Use a configuration framework that supports layering:

Defaults - bundled in the artifact (application config)
Environment overrides - injected via environment variables or mounted config files
Dynamic overrides - read from a feature flag service or configuration service at runtime

Each layer overrides the previous one. The application always has a working default, and environment-specific or dynamic values override only what needs to change.

Config maps and secrets in orchestrators

Kubernetes ConfigMaps and Secrets (or equivalent mechanisms in other orchestrators) provide a clean separation between the artifact (the container image) and the environment-specific configuration. The image is immutable; the configuration is injected at pod startup.

Secrets management with rotation

Use a dedicated secrets manager that supports automatic rotation, audit logging, and fine-grained access control. The application retrieves secrets at startup or on-demand, and the secrets manager handles rotation without requiring redeployment.

Configuration validation at startup

The application should validate its configuration at startup and fail fast with a clear error message if required configuration is missing or invalid. This catches configuration errors immediately rather than allowing the application to start in a broken state.

How to Get Started

Step 1: Inventory your configuration

List every configuration value your application uses. For each one, determine: Does this value change between environments? If yes, it is environment config. If no, it is application config.

Step 2: Move environment config out of the artifact

For every environment-specific value currently bundled in the artifact (hardcoded URLs, build profiles, environment-specific property files), extract it and inject it via environment variable, config map, or secrets manager.

Step 3: Bundle application config with the code

For every value that does not vary between environments, ensure it is committed to version control alongside the source code and included in the artifact at build time. Remove it from any external configuration system where it adds unnecessary complexity.

Step 4: Implement feature flags properly

Set up a feature flag framework with static defaults in the code and an external flag service for dynamic overrides. Ensure the application degrades gracefully if the flag service is unavailable.

Step 5: Remove environment-specific build profiles

Eliminate any build-time branching based on target environment. The build produces one artifact. Period.

Step 6: Automate configuration injection

Ensure that configuration injection is fully automated in the deployment pipeline. No human should manually set environment variables or edit configuration files during deployment.

Common Questions

How do I change application config for a specific environment?

You should not need to. If a value needs to vary by environment, it is environment configuration and should be injected via environment variables or a secrets manager. Application configuration is the same everywhere by definition.

What if I need to hotfix a config value in production?

If it is truly application configuration, make the change in code, commit it, let the pipeline validate it, and deploy the new artifact. Hotfixing config outside the pipeline defeats the purpose of immutable artifacts.

What about config that changes frequently?

If a value changes frequently enough that redeploying is impractical, it might be data, not configuration. Consider whether it belongs in a database or content management system instead. Configuration should be relatively stable - it defines how the application behaves, not what content it serves.

Measuring Progress

Track these metrics to confirm that configuration is being handled correctly:

Configuration drift incidents - should be zero when application config is immutable with the artifact
Config-related rollbacks - track how often configuration changes cause production rollbacks; this should decrease steadily
Time from config commit to production - should match your normal deployment cycle time, confirming that config changes flow through the same pipeline as code changes

Connection to the Pipeline Phase

Application configuration is the enabler that makes immutable artifacts practical. An artifact can only be truly immutable if it does not contain environment-specific values that would need to change between deployments.

Correct configuration separation also supports production-like environments - because the same artifact runs everywhere, the only difference between environments is the injected configuration, which is itself version controlled and automated.

When configuration is externalized correctly, rollback becomes straightforward: deploy the previous artifact with the appropriate configuration, and the system returns to its prior state.

“Works on My Machine” - a symptom caused by configuration that is not externalized or consistent across environments
Environment-Dependent Failures - failures often rooted in configuration differences between environments
Snowflake Environments - an anti-pattern that proper configuration separation helps prevent
Everything as Code - the Foundations practice that establishes version control for configuration
Immutable Artifacts - the Pipeline practice that depends on correct configuration separation
Production-Like Environments - environments where externalized configuration is injected at deployment time

3.6 - Production-Like Environments

Test in environments that match production to catch environment-specific issues early.

Phase 2 - Pipeline

Definition

Production-like environments are pre-production environments that mirror the infrastructure, configuration, and behavior of production closely enough that passing tests in these environments provides genuine confidence that the change will work in production.

“Production-like” does not mean “identical to production” in every dimension. It means that the aspects of the environment relevant to the tests being run match production sufficiently to produce a valid signal. A unit test environment needs the right runtime version. An integration test environment needs the right service topology. A staging environment needs the right infrastructure, networking, and data characteristics.

Why It Matters for CD Migration

The gap between pre-production environments and production is where deployment failures hide. Teams that test in environments that differ significantly from production - in operating system, database version, network topology, resource constraints, or configuration - routinely discover issues only after deployment.

For a CD migration, production-like environments are what transform pre-production testing from “we hope this works” to “we know this works.” They close the gap between the pipeline’s quality signal and the reality of production, making it safe to deploy automatically.

Key Principles

Staging reflects production infrastructure

Your staging environment should match production in the dimensions that affect application behavior:

Infrastructure platform - same cloud provider, same orchestrator, same service mesh
Network topology - same load balancer configuration, same DNS resolution patterns, same firewall rules
Database engine and version - same database type, same version, same configuration parameters
Operating system and runtime - same OS distribution, same runtime version, same system libraries
Service dependencies - same versions of downstream services, or accurate test doubles

Staging does not necessarily need the same scale as production (fewer replicas, smaller instances), but the architecture must be the same.

Environments are version controlled

Every aspect of the environment that can be defined in code must be:

Infrastructure definitions - Terraform, CloudFormation, Pulumi, or equivalent
Configuration - Kubernetes manifests, Helm charts, Ansible playbooks
Network policies - security groups, firewall rules, service mesh configuration
Monitoring and alerting - the same observability configuration in all environments

Version-controlled environments can be reproduced, compared, and audited. Manual environment configuration cannot.

Ephemeral environments

Ephemeral environments are full-stack, on-demand, short-lived environments spun up for a specific purpose - a pull request, a test run, a demo - and destroyed when that purpose is complete.

Key characteristics of ephemeral environments:

Full-stack - they include the application and all of its dependencies (databases, message queues, caches, downstream services), not just the application in isolation
On-demand - any developer or pipeline can spin one up at any time without waiting for a shared resource
Short-lived - they exist for hours or days, not weeks or months. This prevents configuration drift and stale state
Version controlled - the environment definition is in code, and the environment is created from a specific version of that code
Isolated - they do not share resources with other environments. No shared databases, no shared queues, no shared service instances

Ephemeral environments replace the long-lived “static” environments - “development,” “QA1,” “QA2,” “testing” - and the maintenance burden required to keep those stable. They eliminate the “shared staging” bottleneck where multiple teams compete for a single pre-production environment and block each other’s progress.

Data is representative

The data in pre-production environments must be representative of production data in structure, volume, and characteristics. This does not mean using production data directly (which raises security and privacy concerns). It means:

Schema matches production - same tables, same columns, same constraints
Volume is realistic - tests run against data sets large enough to reveal performance issues
Data characteristics are representative - edge cases, special characters, null values, and data distributions that match what the application will encounter
Data is anonymized - if production data is used as a seed, all personally identifiable information is removed or masked

Anti-Patterns

Shared, long-lived staging environments

A single staging environment shared by multiple teams becomes a bottleneck and a source of conflicts. Teams overwrite each other’s changes, queue up for access, and encounter failures caused by other teams’ work. Long-lived environments also drift from production as manual changes accumulate.

Environments that differ from production in critical ways

Running a different database version in staging than production, using a different operating system, or skipping the load balancer that exists in production creates blind spots where issues hide until they reach production.

“It works on my laptop” as validation

Developer laptops are the least production-like environment available. They have different operating systems, different resource constraints, different network characteristics, and different installed software. Local validation is valuable for fast feedback during development, but it does not replace testing in a production-like environment.

Manual environment provisioning

Environments created by manually clicking through cloud consoles, running ad-hoc scripts, or following runbooks are unreproducible and drift over time. If you cannot destroy and recreate the environment from code in minutes, it is not suitable for continuous delivery.

Synthetic-only test data

Using only hand-crafted test data with a few happy-path records misses the issues that emerge with production-scale data: slow queries, missing indexes, encoding problems, and edge cases that only appear in real-world data distributions.

Good Patterns

Infrastructure as Code for all environments

Define every environment - from local development to production - using the same Infrastructure as Code templates. The differences between environments are captured in configuration variables (instance sizes, replica counts, domain names), not in different templates.

Environment-per-pull-request

Automatically provision a full-stack ephemeral environment for every pull request. Run the full test suite against this environment. Tear it down when the pull request is merged or closed. This provides isolated, production-like validation for every change.

Production data sampling and anonymization

Build an automated pipeline that samples production data, anonymizes it (removing PII, masking sensitive fields), and loads it into pre-production environments. This provides realistic data without security or privacy risks.

Service virtualization for external dependencies

For external dependencies that cannot be replicated in pre-production (third-party APIs, partner systems), use service virtualization to create realistic test doubles that mimic the behavior, latency, and error modes of the real service.

Environment parity monitoring

Continuously compare pre-production environments against production to detect drift. Alert when the infrastructure, configuration, or service versions diverge. Tools that compare Terraform state, Kubernetes configurations, or cloud resource inventories can automate this comparison.

Namespaced environments in shared clusters

In Kubernetes or similar platforms, use namespaces to create isolated environments within a shared cluster. Each namespace gets its own set of services, databases, and configuration, providing isolation without the cost of separate clusters.

How to Get Started

Step 1: Audit environment parity

Compare your current pre-production environments against production across every relevant dimension: infrastructure, configuration, data, service versions, network topology. List every difference.

Step 2: Infrastructure-as-Code your environments

If your environments are not yet defined in code, start here. Define your production environment in Terraform, CloudFormation, or equivalent. Then create pre-production environments from the same definitions with different parameter values.

Step 3: Address the highest-risk parity gaps

From your audit, identify the differences most likely to cause production failures - typically database version mismatches, missing infrastructure components, or network configuration differences. Fix these first.

Step 4: Implement ephemeral environments

Build the tooling to spin up and tear down full-stack environments on demand. Start with a simplified version (perhaps without full data replication) and iterate toward full production parity.

Step 5: Automate data provisioning

Create an automated pipeline for generating or sampling representative test data. Include anonymization, schema validation, and data refresh on a regular schedule.

Step 6: Monitor and maintain parity

Set up automated checks that compare pre-production environments to production and alert on drift. Make parity a continuous concern, not a one-time setup.

Connection to the Pipeline Phase

Production-like environments are where the pipeline’s quality gates run. Without production-like environments, the deployable definition produces a false signal - tests pass in an environment that does not resemble production, and failures appear only after deployment.

Immutable artifacts flow through these environments unchanged, with only configuration varying. This combination - same artifact, production-like environment, environment-specific configuration - is what gives the pipeline its predictive power.

Production-like environments also support effective rollback testing: you can validate that a rollback works correctly in a staging environment before relying on it in production.

Staging Passes, Production Fails - the symptom that production-like environments directly address
“Works on My Machine” - a symptom caused by environments that differ from production
Environment-Dependent Failures - test failures rooted in environment parity gaps
Snowflake Environments - the anti-pattern of manually configured, irreproducible environments
Immutable Artifacts - the Pipeline practice that flows unchanged through production-like environments
Application Configuration - the Pipeline practice that handles the configuration differences between environments

3.7 - Pipeline Architecture

Design efficient quality gates for your delivery system’s context.

Phase 2 - Pipeline

Definition

Pipeline architecture is the structural design of your delivery pipeline - how stages are organized, how quality gates are sequenced, how feedback loops operate, and how the pipeline evolves over time. It encompasses both the technical design of the pipeline and the improvement journey that a team follows from an initial, fragile pipeline to a mature, resilient delivery system.

Good pipeline architecture is not achieved in a single step. Teams progress through recognizable states, applying the Theory of Constraints to systematically identify and resolve bottlenecks. The goal is a loosely coupled architecture where independent services can be built, tested, and deployed independently through their own pipelines.

Why It Matters for CD Migration

Most teams beginning a CD migration have a pipeline that is somewhere between “barely functional” and “works most of the time.” The pipeline may be slow, fragile, or tightly coupled to other systems. Improving it requires a deliberate architectural approach - not just adding more stages or more tests, but designing the pipeline for the flow characteristics that continuous delivery demands.

Understanding where your pipeline architecture currently stands, and what the next improvement looks like, prevents teams from either stalling at a “good enough” state or attempting to jump directly to a target state that their context cannot support.

Three Architecture States

Teams typically progress through three recognizable states on their journey to mature pipeline architecture. Understanding which state you are in determines what improvements to prioritize.

Entangled (Requires Remediation)

In the entangled state, the pipeline has significant structural problems that prevent reliable delivery:

Multiple applications share a single pipeline - a change to one application triggers builds and tests for all applications, causing unnecessary delays and false failures
Shared, mutable infrastructure - pipeline stages depend on shared databases, shared environments, or shared services that introduce coupling and contention
Manual stages interrupt automated flow - manual approval gates, manual test execution, or manual environment provisioning block the pipeline for hours or days
No clear ownership - the pipeline is maintained by a central team, and application teams cannot modify it without filing tickets and waiting
Build times measured in hours - the pipeline is so slow that developers batch changes and avoid running it
Flaky tests are accepted - the team routinely re-runs failed pipelines, and failures are assumed to be transient

Remediation priorities:

Separate pipelines for separate applications
Remove manual stages or parallelize them out of the critical path
Fix or remove flaky tests
Establish clear pipeline ownership with the application team

Tightly Coupled (Transitional)

In the tightly coupled state, each application has its own pipeline, but pipelines depend on each other or on shared resources:

Integration tests span multiple services - a pipeline for service A runs integration tests that require service B, C, and D to be deployed in a specific state
Shared test environments - multiple pipelines deploy to the same staging environment, creating contention and sequencing constraints
Coordinated deployments - deploying service A requires simultaneously deploying service B, which requires coordinating two pipelines
Shared build infrastructure - pipelines compete for limited build agent capacity, causing queuing delays
Pipeline definitions are centralized - a shared pipeline library controls the structure, and application teams cannot customize it for their needs

Improvement priorities:

Replace cross-service integration tests with contract tests
Implement ephemeral environments to eliminate shared environment contention
Decouple service deployments using backward-compatible changes and feature flags
Give teams ownership of their pipeline definitions
Scale build infrastructure to eliminate queuing

Loosely Coupled (Goal)

In the loosely coupled state, each service has an independent pipeline that can build, test, and deploy without depending on other services’ pipelines:

Independent deployability - any service can be deployed at any time without coordinating with other teams
Contract-based integration - services verify their interactions through contract tests, not cross-service integration tests
Ephemeral, isolated environments - each pipeline creates its own test environment and tears it down when done
Team-owned pipelines - each team controls their pipeline definition and can optimize it for their service’s needs
Fast feedback - the pipeline completes in minutes, providing rapid feedback to developers
Self-service infrastructure - teams provision their own pipeline infrastructure without waiting for a central team

Applying the Theory of Constraints

Pipeline improvement follows the Theory of Constraints: identify the single biggest bottleneck, resolve it, and repeat. The key steps:

Step 1: Identify the constraint

Measure where time is spent in the pipeline. Common constraints include:

Slow test suites - tests that take 30+ minutes dominate the pipeline duration
Queuing for shared resources - pipelines waiting for build agents, shared environments, or manual approvals
Flaky failures and re-runs - time lost to investigating and re-running non-deterministic failures
Large batch sizes - pipelines triggered by large, infrequent commits that take longer to build and are harder to debug when they fail

Step 2: Exploit the constraint

Get the maximum throughput from the current constraint without changing the architecture:

Parallelize test execution across multiple agents
Cache dependencies to speed up the build stage
Prioritize pipeline runs (trunk commits before branch builds)
Deduplicate unnecessary work (skip unchanged modules)

Step 3: Subordinate everything else to the constraint

Ensure that other parts of the system do not overwhelm the constraint:

If the test stage is the bottleneck, do not add more tests without first making existing tests faster
If the build stage is the bottleneck, do not add more build steps without first optimizing the build

Step 4: Elevate the constraint

If exploiting the constraint is not sufficient, invest in removing it:

Rewrite slow tests to be faster
Replace shared environments with ephemeral environments
Replace manual gates with automated checks
Split monolithic pipelines into independent service pipelines

Step 5: Repeat

Once a constraint is resolved, a new constraint will emerge. This is expected. The pipeline improves through continuous iteration, not through a single redesign.

Key Design Principles

Fast feedback first

Organize pipeline stages so that the fastest checks run first. A developer should know within minutes if their change has an obvious problem (compilation failure, linting error, unit test failure). Slower checks (integration tests, security scans, performance tests) run after the fast checks pass.

Fail fast, fail clearly

When the pipeline fails, it should fail as early as possible and produce a clear, actionable error message. A developer should be able to read the failure output and know exactly what to fix without digging through logs.

Parallelize where possible

Stages that do not depend on each other should run in parallel. Security scans can run alongside integration tests. Linting can run alongside compilation. Parallelization is the most effective way to reduce pipeline duration without removing checks.

Pipeline as code

The pipeline definition lives in the same repository as the application it builds and deploys. This gives the team full ownership and allows the pipeline to evolve alongside the application.

Observability

Instrument the pipeline itself with metrics and monitoring. Track:

Lead time - time from commit to production deployment
Pipeline duration - time from pipeline start to completion
Failure rate - percentage of pipeline runs that fail
Recovery time - time from failure detection to successful re-run
Queue time - time spent waiting before the pipeline starts

These metrics identify bottlenecks and measure improvement over time.

Anti-Patterns

The “grand redesign”

Attempting to redesign the entire pipeline at once, rather than iteratively improving the biggest constraint, is a common failure mode. Grand redesigns take too long, introduce too much risk, and often fail to address the actual problems.

Central pipeline teams that own all pipelines

A central team that controls all pipeline definitions creates a bottleneck. Application teams wait for changes, cannot customize pipelines for their context, and are disconnected from their own delivery process.

Optimizing non-constraints

Speeding up a pipeline stage that is not the bottleneck does not improve overall delivery time. Measure before optimizing.

Monolithic pipeline for microservices

Running all microservices through a single pipeline that builds and deploys everything together defeats the purpose of a microservice architecture. Each service should have its own independent pipeline.

How to Get Started

Step 1: Assess your current state

Determine which architecture state - entangled, tightly coupled, or loosely coupled - best describes your current pipeline. Be honest about where you are.

Step 2: Measure your pipeline

Instrument your pipeline to measure duration, failure rates, queue times, and bottlenecks. You cannot improve what you do not measure.

Step 3: Identify the top constraint

Using your measurements, identify the single biggest bottleneck in your pipeline. This is where you focus first.

Step 4: Apply the Theory of Constraints cycle

Exploit, subordinate, and if necessary elevate the constraint. Then measure again and identify the next constraint.

Step 5: Evolve toward loose coupling

With each improvement cycle, move toward independent, team-owned pipelines that can build, test, and deploy services independently. This is a journey of months or years, not days.

Connection to the Pipeline Phase

Pipeline architecture is where all the other practices in this phase come together. The single path to production defines the route. The deterministic pipeline ensures reliability. The deployable definition defines the quality gates. The architecture determines how these elements are organized, sequenced, and optimized for flow.

As teams mature their pipeline architecture toward loose coupling, they build the foundation for Phase 3: Optimize - where the focus shifts from building the pipeline to improving its speed and reliability.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Slow Pipelines - a symptom directly addressed by applying the Theory of Constraints to pipeline architecture
Coordinated Deployments - a symptom of tightly coupled pipeline architecture
No Fast Feedback - a symptom that pipeline architecture improvements resolve through stage ordering and parallelization
Missing Deployment Pipeline - the anti-pattern that pipeline architecture replaces
Release Frequency - a key metric that improves as pipeline architecture matures toward loose coupling
Phase 3: Optimize - the next phase, which builds on mature pipeline architecture

3.8 - Rollback

Enable fast recovery from any deployment by maintaining the ability to roll back.

Phase 2 - Pipeline

Definition

Rollback is the ability to quickly and safely revert a production deployment to a previous known-good state. It is the safety net that makes continuous delivery possible: because you can always undo a deployment, deploying becomes a low-risk, routine operation.

Rollback is not a backup plan for when things go catastrophically wrong. It is a standard operational capability that should be exercised regularly and trusted completely. Every deployment to production should be accompanied by a tested, automated, fast rollback mechanism.

Why It Matters for CD Migration

Fear of deployment is the single biggest cultural barrier to continuous delivery. Teams that have experienced painful, irreversible deployments develop a natural aversion to deploying frequently. They batch changes, delay releases, and add manual approval gates - all of which slow delivery and increase risk.

Reliable, fast rollback breaks this cycle. When the team knows that any deployment can be reversed in minutes, the perceived risk of deployment drops dramatically. Smaller, more frequent deployments become possible. The feedback loop tightens. The entire delivery system improves.

Key Principles

Fast

Rollback must complete in minutes, not hours. A rollback that takes an hour to execute is not a rollback - it is a prolonged outage with a recovery plan. Target rollback times of 5 minutes or less for the deployment mechanism itself. If the previous artifact is already in the artifact repository and the deployment mechanism is automated, there is no reason rollback should take longer than a fresh deployment.

Automated

Rollback must be a single command or a single click - or better, fully automated based on health checks. It should not require:

SSH access to production servers
Manual editing of configuration files
Running scripts with environment-specific parameters from memory
Coordinating multiple teams to roll back multiple services simultaneously

Safe

Rollback must not make things worse. This means:

Rolling back must not lose data
Rolling back must not corrupt state
Rolling back must not break other services that depend on the rolled-back service
Rolling back must not require downtime beyond what the deployment mechanism itself imposes

Simple

The rollback procedure should be understandable by any team member, including those who did not perform the original deployment. It should not require specialized knowledge, deep system understanding, or heroic troubleshooting.

Tested

Rollback must be tested regularly, not just documented. A rollback procedure that has never been exercised is a rollback procedure that will fail when you need it most. Include rollback verification in your deployable definition and practice rollback as part of routine deployment validation.

Rollback Strategies

Blue-Green Deployment

Maintain two identical production environments - blue and green. At any time, one is live (serving traffic) and the other is idle. To deploy, deploy to the idle environment, verify it, and switch traffic. To roll back, switch traffic back to the previous environment.

Blue-green rollback: traffic switch to previous environment

Blue (current): v1.2.3
Green (idle):   v1.2.2

Issue detected in Blue
  |
Switch traffic to Green (v1.2.2)
  |
Instant rollback (< 30 seconds)

Advantages:

Rollback is instantaneous - just a traffic switch
The previous version remains running and warm
Zero-downtime deployment and rollback

Considerations:

Requires double the infrastructure (though the idle environment can be scaled down)
Database changes must be backward-compatible across both versions
Session state must be externalized so it survives the switch

Canary Deployment

Deploy the new version to a small subset of production infrastructure (the “canary”) and route a percentage of traffic to it. Monitor the canary for errors, latency, and business metrics. If the canary is healthy, gradually increase traffic. If problems appear, route all traffic back to the previous version.

Canary rollback: stop routing traffic to the canary on issue detection

Deploy v1.2.3 to 10% of servers
  |
Issue detected in monitoring
  |
Automatically roll back 10% to v1.2.2
  |
Issue contained, minimal user impact

Advantages:

Limits blast radius - problems affect only a fraction of users
Provides real production data for validation before full rollout
Rollback is fast - stop sending traffic to the canary

Considerations:

Requires traffic routing infrastructure (service mesh, load balancer configuration)
Both versions must be able to run simultaneously
Monitoring must be sophisticated enough to detect subtle problems in the canary

Feature Flag Rollback

When a deployment introduces new behavior behind a feature flag, rollback can be as simple as turning off the flag. The code remains deployed, but the new behavior is disabled. This is the fastest possible rollback - it requires no deployment at all.

Feature flag rollback: disable new behavior without redeploying

// Feature flag controls new behavior
if (featureFlags.isEnabled('new-checkout')) {
  return renderNewCheckout()
}
return renderOldCheckout()

// Rollback: Toggle flag off via configuration
// No deployment needed, instant effect

Advantages:

Instantaneous - no deployment, no traffic switch
Granular - roll back a single feature without affecting other changes
No infrastructure changes required

Considerations:

Requires a feature flag system with runtime toggle capability
Only works for changes that are behind flags
Feature flag debt (old flags that are never cleaned up) must be managed

Database-Safe Rollback with Expand-Contract

Database schema changes are the most common obstacle to rollback. If a deployment changes the database schema, rolling back the application code may fail if the old code is incompatible with the new schema.

The expand-contract pattern (also called parallel change) solves this:

Expand - add new columns, tables, or structures alongside the existing ones. The old application code continues to work. Deploy this change.
Migrate - update the application to write to both old and new structures, and read from the new structure. Deploy this change. Backfill historical data.
Contract - once all application versions using the old structure are retired, remove the old columns or tables. Deploy this change.

At every step, the previous application version remains compatible with the current database schema. Rollback is always safe.

Expand-contract pattern: safe additive schema changes vs. unsafe destructive changes

-- Safe: Additive change (expand)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Old code ignores the new column
-- New code uses the new column
-- Rolling back code does not break anything

-- Unsafe: Destructive change
ALTER TABLE users DROP COLUMN email;
-- Old code breaks because email column is gone
-- Rollback requires schema rollback (risky)

Anti-pattern: Destructive schema changes (dropping columns, renaming tables, changing types) deployed simultaneously with the application code change that requires them. This makes rollback impossible because the old code cannot work with the new schema.

Anti-Patterns

“We’ll fix forward”

Relying exclusively on fixing forward (deploying a new fix rather than rolling back) is dangerous when the system is actively degraded. Fix-forward should be an option when the issue is well-understood and the fix is quick. Rollback should be the default when the issue is unclear or the fix will take time. Both capabilities must exist.

Rollback as a documented procedure only

A rollback procedure that exists only in a runbook, wiki, or someone’s memory is not a reliable rollback capability. Procedures that are not automated and regularly tested will fail under the pressure of a production incident.

Coupled service rollbacks

When rolling back service A requires simultaneously rolling back services B and C, you do not have independent rollback capability. Design services to be backward-compatible so that each service can be rolled back independently.

Destructive database migrations

Schema changes that destroy data or break backward compatibility make rollback impossible. Always use the expand-contract pattern for schema changes.

Manual rollback requiring specialized knowledge

If only one person on the team knows how to perform a rollback, the team does not have a rollback capability - it has a single point of failure. Rollback must be simple enough for any team member to execute.

Good Patterns

Automated rollback on health check failure

Configure the deployment system to automatically roll back if the new version fails health checks within a defined window after deployment. This removes the need for a human to detect the problem and initiate the rollback.

Rollback testing in staging

As part of every deployment to staging, deploy the new version, verify it, then roll it back and verify the rollback. This ensures that rollback works for every release, not just in theory.

Artifact retention

Retain previous artifact versions in the artifact repository so that rollback is always possible. Define a retention policy (for example, keep the last 10 production-deployed versions) and ensure that rollback targets are always available.

Deployment log and audit trail

Maintain a clear record of what is currently deployed, what was previously deployed, and when changes occurred. This makes it easy to identify the correct rollback target and verify that the rollback was successful.

Rollback runbook exercises

Regularly practice rollback as a team exercise - not just as part of automated testing, but as a deliberate drill. This builds team confidence and identifies gaps in the process.

How to Get Started

Step 1: Document your current rollback capability

Can you roll back your current production deployment right now? How long would it take? Who would need to be involved? What could go wrong? Be honest about the answers.

Step 2: Implement a basic automated rollback

Start with the simplest mechanism available for your deployment platform - redeploying the previous container image, switching a load balancer target, or reverting a Kubernetes deployment. Automate this as a single command.

Step 3: Test the rollback

Deploy a change to staging, then roll it back. Verify that the system returns to its previous state. Make this a standard part of your deployment validation.

Step 4: Address database compatibility

Audit your database migration practices. If you are making destructive schema changes, shift to the expand-contract pattern. Ensure that the previous application version is always compatible with the current database schema.

Step 5: Reduce rollback time

Measure how long rollback takes. Identify and eliminate delays - slow artifact downloads, slow startup times, manual steps. Target rollback completion in under 5 minutes.

Step 6: Build team confidence

Practice rollback regularly. Demonstrate it during deployment reviews. Make it a normal part of operations, not an emergency procedure. When the team trusts rollback, they will trust deployment.

Connection to the Pipeline Phase

Rollback is the capstone of the Pipeline phase. It is what makes the rest of the phase safe:

The single path to production is how rollback is deployed - the same pipeline, the same path, in reverse
Immutable artifacts are what make rollback reliable - the previous artifact is unchanged in the artifact repository, ready to be redeployed
The deployable definition should include rollback verification as one of its quality gates
Application configuration separation ensures that rolling back the artifact does not require rolling back environment configuration
Production-like environments are where rollback is tested before it is needed in production

With rollback in place, the team has the confidence to deploy frequently, which is the foundation for Phase 3: Optimize and ultimately Phase 4: Deliver on Demand.

FAQ

How far back should we be able to roll back?

At minimum, keep the last 3 to 5 production releases available for rollback. Ideally, retain any production release from the past 30 to 90 days. Balance storage costs with rollback flexibility by defining a retention policy for your artifact repository.

What if the database schema changed?

Design schema changes to be backward-compatible:

Use the expand-contract pattern described above
Make schema changes in a separate deployment from the code changes that depend on them
Test that the old application code works with the new schema before deploying the code change

What if we need to roll back the database too?

Database rollbacks are inherently risky because they can destroy data. Instead of rolling back the database:

Design schema changes to support application rollback (backward compatibility)
Use feature flags to disable code that depends on the new schema
If absolutely necessary, maintain tested database rollback scripts - but treat this as a last resort

Should rollback require approval?

No. The on-call engineer should be empowered to roll back immediately without waiting for approval. Speed of recovery is critical during an incident. Post-rollback review is appropriate, but requiring approval before rollback adds delay when every minute counts.

How do we test rollback?

Practice regularly - perform rollback drills during low-traffic periods
Automate testing - include rollback verification in your pipeline
Use staging - test rollback in staging before every production deployment
Run chaos exercises - randomly trigger rollbacks to ensure they work under realistic conditions

What if rollback fails?

Have a contingency plan:

Roll forward to the next known-good version
Use feature flags to disable the problematic behavior
Have an out-of-band deployment method as a last resort

If rollback is regularly tested, failures should be extremely rare.

How long should rollback take?

Target under 5 minutes from the decision to roll back to service restored.

Typical breakdown:

Trigger rollback: under 30 seconds
Deploy previous artifact: 2 to 3 minutes
Verify with smoke tests: 1 to 2 minutes

What about configuration changes?

Configuration should be versioned and separated from the application artifact. Rolling back the artifact should not require separately rolling back environment configuration. See Application Configuration for how to achieve this.

Fear of Deploying - the symptom that reliable rollback capability directly resolves
Infrequent Releases - a symptom driven by deployment risk that rollback mitigates
Manual Deployments - an anti-pattern incompatible with fast, automated rollback
Immutable Artifacts - the Pipeline practice that makes rollback reliable by preserving previous artifacts
Mean Time to Repair - a key metric that rollback capability directly improves
Feature Flags - an Optimize practice that provides an alternative rollback mechanism at the feature level

4 - Phase 3: Optimize

Improve flow by reducing batch size, limiting work in progress, and using metrics to drive improvement.

Key question: “Can we deliver small changes quickly?”

With a working pipeline in place, this phase focuses on optimizing the flow of changes through it. Smaller batches, feature flags, and WIP limits reduce risk and increase delivery frequency.

What You’ll Do

Reduce batch size - Deliver smaller, more frequent changes
Use feature flags - Decouple deployment from release
Limit work in progress - Focus on finishing over starting
Drive improvement with metrics - Use DORA metrics and improvement kata
Run effective retrospectives - Continuously improve the delivery process
Decouple architecture - Enable independent deployment of components
Align teams to code - Match team ownership to code boundaries for independent deployment

Why This Phase Matters

Having a pipeline isn’t enough - you need to optimize the flow through it. Teams that deploy weekly with a CD pipeline are missing most of the benefits. Small batches reduce risk, feature flags enable testing in production, and metrics-driven improvement creates a virtuous cycle of getting better at getting better.

When You’re Ready to Move On

You’re ready for Phase 4: Deliver on Demand when:

Most changes are small enough to deploy independently
Feature flags let you deploy incomplete features safely
Your WIP limits keep work flowing without bottlenecks
You’re measuring and improving your DORA metrics regularly

Next: Phase 4 - Continuous Deployment - remove the last manual gates and deploy on demand.

Phase 2: Pipeline - the previous phase that establishes the deployment pipeline this phase optimizes
Phase 4: Deliver on Demand - the next phase after flow is optimized
Infrequent Releases - a key symptom that the Optimize phase addresses
Too Much WIP - a flow symptom targeted by WIP limits and small batches
DORA Recommended Practices - the research-backed capabilities that drive delivery performance
Deployment Frequency - the primary metric that improves as optimization takes hold

4.1 - Small Batches

Deliver smaller, more frequent changes to reduce risk and increase feedback speed.

Phase 3 - Optimize

Batch size is the single biggest lever for improving delivery performance. This page covers what batch size means at every level - deploy frequency, commit size, and story size - and provides concrete techniques for reducing it.

Why Batch Size Matters

Large batches create large risks. When you deploy 50 changes at once, any failure could be caused by any of those 50 changes. When you deploy 1 change, the cause of any failure is obvious.

This is not a theory. The DORA research consistently shows that elite teams deploy more frequently, with smaller changes, and have both higher throughput and lower failure rates. Small batches are the mechanism that makes this possible.

“If it hurts, do it more often, and bring the pain forward.”
Jez Humble, Continuous Delivery

Three Levels of Batch Size

Batch size is not just about deployments. It operates at three distinct levels, and optimizing only one while ignoring the others limits your improvement.

Level 1: Deploy Frequency

How often you push changes to production.

State	Deploy Frequency	Risk Profile
Starting	Monthly or quarterly	Each deploy is a high-stakes event
Improving	Weekly	Deploys are planned but routine
Optimizing	Daily	Deploys are unremarkable
Elite	Multiple times per day	Deploys are invisible

How to reduce: Remove manual gates, automate approval workflows, build confidence through progressive rollout. If your pipeline is reliable (Phase 2), the only thing preventing more frequent deploys is organizational habit.

Common objections to deploying more often:

“Incomplete features have no value.” Value is not limited to end-user features. Every deployment provides value to other stakeholders: operations verifies that the change is safe, QA confirms quality gates pass, and the team reduces inventory waste by keeping unintegrated work near zero. A partially built feature deployed behind a flag validates the deployment pipeline and reduces the risk of the final release.
“Our customers don’t want changes that frequently.” CD is not about shipping user-visible changes every hour. It is about maintaining the ability to deploy at any time. That ability is what lets you ship an emergency fix in minutes instead of days, roll out a security patch without a war room, and support production without heroics.

Level 2: Commit Size

How much code changes in each commit to trunk.

Indicator	Too Large	Right-Sized
Files changed	20+ files	1-5 files
Lines changed	500+ lines	Under 100 lines
Review time	Hours or days	Minutes
Merge conflicts	Frequent	Rare
Description length	Paragraph needed	One sentence suffices

How to reduce: Practice TDD (write one test, make it pass, commit). Use feature flags to merge incomplete work. Pair program so review happens in real time.

Level 3: Story Size

How much scope each user story or work item contains.

A story that takes a week to complete is a large batch. It means a week of work piles up before integration, a week of assumptions go untested, and a week of inventory sits in progress.

Target: Every story should be completable - coded, tested, reviewed, and integrated - in two days or less. If it cannot be, it needs to be decomposed further.

“If a story is going to take more than a day to complete, it is too big.”
Paul Hammant

This target is not aspirational. Teams that adopt hyper-sprints - iterations as short as 2.5 days - find that the discipline of writing one-day stories forces better decomposition and faster feedback. Teams that make this shift routinely see throughput double, not because people work faster, but because smaller stories flow through the system with less wait time, fewer handoffs, and fewer defects.

Behavior-Driven Development for Decomposition

BDD provides a concrete technique for breaking stories into small, testable increments. The Given-When-Then format forces clarity about scope.

The Given-When-Then Pattern

BDD scenarios for shopping cart discount feature

Feature: Shopping cart discount

  Scenario: Apply percentage discount to cart
    Given a cart with items totaling $100
    When I apply a 10% discount code
    Then the cart total should be $90

  Scenario: Reject expired discount code
    Given a cart with items totaling $100
    When I apply an expired discount code
    Then the cart total should remain $100
    And I should see "This discount code has expired"

  Scenario: Apply discount only to eligible items
    Given a cart with one eligible item at $50 and one ineligible item at $50
    When I apply a 10% discount code
    Then the cart total should be $95

Each scenario becomes a deliverable increment. You can implement and deploy the first scenario before starting the second. This is how you turn a “discount feature” (large batch) into three independent, deployable changes (small batches).

Decomposing Stories Using Scenarios

When a story has too many scenarios, it is too large. Use this process:

Write all the scenarios first. Before any code, enumerate every Given-When-Then for the story.
Group scenarios into deliverable slices. Each slice should be independently valuable or at least independently deployable.
Create one story per slice. Each story has 1-3 scenarios and can be completed in 1-2 days.
Order the slices by value. Deliver the most important behavior first.

Example decomposition:

Original Story	Scenarios	Sliced Into
“As a user, I can manage my profile”	12 scenarios covering name, email, password, avatar, notifications, privacy, deactivation	5 stories: basic info (2 scenarios), password (2), avatar (2), notifications (3), deactivation (3)

ATDD: Connecting Scenarios to Daily Integration

BDD scenarios define what to build. Acceptance Test-Driven Development (ATDD) defines how to build it in small, integrated steps. The workflow is:

Pick one scenario. Choose the next Given-When-Then from your story.
Write the acceptance test first. Automate the scenario so it runs against the real system (or a close approximation). It will fail - this is the RED state.
Write just enough code to pass. Implement the minimum production code to make the acceptance test pass - the GREEN state.
Refactor. Clean up the code while the test stays green.
Commit and integrate. Push to trunk. The pipeline verifies the change.
Repeat. Pick the next scenario.

Each cycle produces a commit that is independently deployable and verified by an automated test. This is how BDD scenarios translate directly into a stream of small, safe integrations rather than a batch of changes delivered at the end of a story.

Key benefits:

Every commit has a corresponding acceptance test, so you know exactly what it does and that it works.
You never go more than a few hours without integrating to trunk.
The acceptance tests accumulate into a regression suite that protects future changes.
If a commit breaks something, the scope of the change is small enough to diagnose quickly.

Service-Level Decomposition Example

ATDD works at the API and service level, not just at the UI level. Here is an example of building an order history endpoint day by day:

Day 1 - Return an empty list for a customer with no orders:

Day 1 scenario: empty order history endpoint

Scenario: Customer with no order history
  Given a customer with no previous orders
  When I request their order history
  Then I receive an empty list with a 200 status

Commit: Implement the endpoint, return an empty JSON array. Acceptance test passes.

Day 2 - Return a single order with basic fields:

Day 2 scenario: return a single order with basic fields

Scenario: Customer with one completed order
  Given a customer with one completed order for $49.99
  When I request their order history
  Then I receive a list with one order showing the total and status

Commit: Query the orders table, serialize basic fields. Previous test still passes.

Day 3 - Return multiple orders sorted by date:

Day 3 scenario: return orders sorted by date

Scenario: Orders returned in reverse chronological order
  Given a customer with orders placed on Jan 1, Feb 1, and Mar 1
  When I request their order history
  Then the orders are returned with the Mar 1 order first

Commit: Add sorting logic and pagination. All three tests pass.

Each day produces a deployable change. The endpoint is usable (though minimal) after day 1. No day requires more than a few hours of coding because the scope is constrained by a single scenario.

Vertical Slicing

A vertical slice cuts through all layers of the system to deliver a thin piece of end-to-end functionality. This is the opposite of horizontal slicing, where you build all the database changes, then all the API changes, then all the UI changes.

Horizontal vs. Vertical Slicing

Horizontal (avoid):

Horizontal slicing: stories split by architectural layer

Story 1: Build the database schema for discounts
Story 2: Build the API endpoints for discounts
Story 3: Build the UI for applying discounts

Problems: Story 1 and 2 deliver no user value. You cannot test end-to-end until story 3 is done. Integration risk accumulates.

Vertical (prefer):

Vertical slicing: stories split by user-observable behavior

Story 1: Apply a simple percentage discount (DB + API + UI for one scenario)
Story 2: Reject expired discount codes (DB + API + UI for one scenario)
Story 3: Apply discounts only to eligible items (DB + API + UI for one scenario)

Benefits: Every story delivers testable, deployable functionality. Integration happens with each story, not at the end. You can ship story 1 and get feedback before building story 2.

How to Slice Vertically

Ask these questions about each proposed story:

Can a user (or another system) observe the change? If not, slice differently.
Can I write an end-to-end test for it? If not, the slice is incomplete.
Does it require all other slices to be useful? If yes, find a thinner first slice.
Can it be deployed independently? If not, check whether feature flags could help.

Vertical slicing in distributed systems

The examples above assume a team that owns the full stack - UI, API, and database. In large distributed systems, most teams own a subdomain and may not be directly user-facing.

The principle is the same. A subdomain product team’s vertical slice cuts through all layers they control: the service API, the business logic, and the data store. “End-to-end” means end-to-end within your domain, not end-to-end across the entire system. The team deploys independently behind a stable contract, without coordinating with other teams.

The key difference is whether the public interface is designed for humans or machines. A full-stack product team owns a human-facing surface - the slice is done when a user can observe the behavior through that interface. A subdomain product team owns a machine-facing surface - the slice is done when the API contract satisfies the agreed behavior for its service consumers.

See Work Decomposition for diagrams of both contexts, and Horizontal Slicing for the failure mode that emerges when distributed teams split work by layer instead of by behavior.

Story Slicing Anti-Patterns

These are common ways teams slice stories that undermine the benefits of small batches:

Wrong: Slice by layer. “Story 1: Build the database. Story 2: Build the API. Story 3: Build the UI.” Right: Slice vertically so each story touches all layers and delivers observable behavior.

Wrong: Slice by activity. “Story 1: Design. Story 2: Implement. Story 3: Test.” Right: Each story includes all activities needed to deliver and verify one behavior.

Wrong: Create dependent stories. “Story 2 cannot start until Story 1 is finished because it depends on the data model.” Right: Each story is independently deployable. Use contracts, feature flags, or stubs to break dependencies between stories.

Wrong: Lose testability. “This story just sets up infrastructure - there is nothing to test yet.” Right: Every story has at least one automated test that verifies its behavior. If you cannot write a test, the slice does not deliver observable value.

Practical Steps for Reducing Batch Size

Step 1: Measure Current State

Before changing anything, measure where you are:

Average commit size (lines changed per commit)
Average story cycle time (time from start to done)
Deploy frequency (how often changes reach production)
Average changes per deploy (how many commits per deployment)

Step 2: Introduce Story Decomposition

Start writing BDD scenarios before implementation
Split any story estimated at more than 2 days
Track the number of stories completed per week (expect this to increase as stories get smaller)

Step 3: Tighten Commit Size

Adopt the discipline of “one logical change per commit”
Use TDD to create a natural commit rhythm: write test, make it pass, commit
Track average commit size and set a team target (e.g., under 100 lines)

Ongoing: Increase Deploy Frequency

Deploy at least once per day, then work toward multiple times per day
Remove any batch-oriented processes (e.g., “we deploy on Tuesdays”)
Make deployment a non-event

Key Pitfalls

1. “Small stories take more overhead to manage”

This is true only if your process adds overhead per story (e.g., heavyweight estimation ceremonies, multi-level approval). The solution is to simplify the process, not to keep stories large. Overhead per story should be near zero for a well-decomposed story.

2. “Some things can’t be done in small batches”

Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. API changes can use versioning. UI changes can be hidden behind feature flags. The skill is in finding the decomposition, not in deciding whether one exists.

3. “We tried small stories but our throughput dropped”

This usually means the team is still working sequentially. Small stories require limiting WIP and swarming - see Limiting WIP. If the team starts 10 small stories instead of 2 large ones, they have not actually reduced batch size; they have increased WIP.

Measuring Success

Metric	Target	Why It Matters
Development cycle time	< 2 days per story	Confirms stories are small enough to complete quickly
Integration frequency	Multiple times per day	Confirms commits are small and frequent
Release frequency	Daily or more	Confirms deploys are routine
Change fail rate	Decreasing	Confirms small changes reduce failure risk

Next Step

Small batches often require deploying incomplete features to production. Feature Flags provide the mechanism to do this safely.

Infrequent Releases - the symptom of deploying too rarely that small batches directly address
Hardening Sprints - a symptom caused by large batch sizes requiring stabilization periods
Monolithic Work Items - the anti-pattern of stories too large to deliver in small increments
Horizontal Slicing - the anti-pattern of splitting work by layer instead of by value
Work Decomposition - the foundational practice for breaking work into small deliverable pieces
Feature Flags - the mechanism that makes deploying incomplete small batches safe
Small-Batch Agent Sessions - applying the same one-scenario-one-commit discipline to agent-generated work

4.2 - Feature Flags

Decouple deployment from release by using feature flags to control feature visibility.

Phase 3 - Optimize

Feature flags are the mechanism that makes trunk-based development and small batches safe. They let you deploy code to production without exposing it to users, enabling dark launches, gradual rollouts, and instant rollback of features without redeploying.

Why Feature Flags?

In continuous delivery, deployment and release are two separate events:

Deployment is pushing code to production.
Release is making a feature available to users.

Feature flags are the bridge between these two events. They let you deploy frequently (even multiple times a day) without worrying about exposing incomplete or untested features. This separation is what makes continuous deployment possible for teams that ship real products to real users.

When You Need Feature Flags (and When You Don’t)

Not every change requires a feature flag. Flags add complexity, and unnecessary complexity slows you down. Use this decision tree to determine the right approach.

Decision Tree

graph TD
    Start[New Code Change] --> Q1{Is this a large or<br/>high-risk change?}

    Q1 -->|Yes| Q2{Do you need gradual<br/>rollout or testing<br/>in production?}
    Q1 -->|No| Q3{Is the feature<br/>incomplete or spans<br/>multiple releases?}

    Q2 -->|Yes| UseFF1[YES - USE FEATURE FLAG<br/>Enables safe rollout<br/>and quick rollback]
    Q2 -->|No| Q4{Do you need to<br/>test in production<br/>before full release?}

    Q3 -->|Yes| Q3A{Can you use an<br/>alternative pattern?}
    Q3 -->|No| Q5{Do different users/<br/>customers need<br/>different behavior?}

    Q3A -->|New Feature| NoFF_NewFeature[NO FLAG NEEDED<br/>Connect to tests only,<br/>integrate in final commit]
    Q3A -->|Behavior Change| NoFF_Abstraction[NO FLAG NEEDED<br/>Use branch by<br/>abstraction pattern]
    Q3A -->|New API Route| NoFF_API[NO FLAG NEEDED<br/>Build route, expose<br/>as last change]
    Q3A -->|Not Applicable| UseFF2[YES - USE FEATURE FLAG<br/>Enables trunk-based<br/>development]

    Q4 -->|Yes| UseFF3[YES - USE FEATURE FLAG<br/>Dark launch or<br/>beta testing]
    Q4 -->|No| Q6{Is this an<br/>experiment or<br/>A/B test?}

    Q5 -->|Yes| UseFF4[YES - USE FEATURE FLAG<br/>Customer-specific<br/>toggles needed]
    Q5 -->|No| Q7{Does change require<br/>coordination with<br/>other teams/services?}

    Q6 -->|Yes| UseFF5[YES - USE FEATURE FLAG<br/>Required for<br/>experimentation]
    Q6 -->|No| NoFF1[NO FLAG NEEDED<br/>Simple change,<br/>deploy directly]

    Q7 -->|Yes| UseFF6[YES - USE FEATURE FLAG<br/>Enables independent<br/>deployment]
    Q7 -->|No| Q8{Is this a bug fix<br/>or hotfix?}

    Q8 -->|Yes| NoFF2[NO FLAG NEEDED<br/>Deploy immediately]
    Q8 -->|No| NoFF3[NO FLAG NEEDED<br/>Standard deployment<br/>sufficient]

    style UseFF1 fill:#90EE90
    style UseFF2 fill:#90EE90
    style UseFF3 fill:#90EE90
    style UseFF4 fill:#90EE90
    style UseFF5 fill:#90EE90
    style UseFF6 fill:#90EE90
    style NoFF1 fill:#FFB6C6
    style NoFF2 fill:#FFB6C6
    style NoFF3 fill:#FFB6C6
    style NoFF_NewFeature fill:#FFB6C6
    style NoFF_Abstraction fill:#FFB6C6
    style NoFF_API fill:#FFB6C6
    style Start fill:#87CEEB

Alternatives to Feature Flags

Technique	How It Works	When to Use
Branch by Abstraction	Introduce an abstraction layer, build the new implementation behind it, switch when ready	Replacing an existing subsystem or library
Connect Tests Last	Build internal components without connecting them to the UI or API	New backend functionality that has no user-facing impact until connected
Dark Launch	Deploy the code path but do not route any traffic to it	New infrastructure, new services, or new endpoints that are not yet referenced

These alternatives avoid the lifecycle overhead of feature flags while still enabling trunk-based development with incomplete work.

Implementation Approaches

Feature flags can be implemented at different levels of sophistication. Start simple and add complexity only when needed.

Level 1: Static Code-Based Flags

The simplest approach: a boolean constant or configuration value checked in code.

Level 1: Static boolean flag in code

# config.py
FEATURE_NEW_CHECKOUT = False

# checkout.py
from config import FEATURE_NEW_CHECKOUT

def process_checkout(cart, user):
    if FEATURE_NEW_CHECKOUT:
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Zero infrastructure. Easy to understand. Works everywhere.

Cons: Changing a flag requires a deployment. No per-user targeting. No gradual rollout.

Best for: Teams starting out. Internal tools. Changes that will be fully on or fully off.

Level 2: Dynamic In-Process Flags

Flags stored in a configuration file, database, or environment variable that can be changed at runtime without redeploying.

Level 2: Dynamic in-process flag service with percentage rollout

# flag_service.py
import json

class FeatureFlags:
    def __init__(self, config_path="/etc/flags.json"):
        self._config_path = config_path

    def is_enabled(self, flag_name, context=None):
        flags = json.load(open(self._config_path))
        flag = flags.get(flag_name, {})

        if not flag.get("enabled", False):
            return False

        # Percentage rollout
        if "percentage" in flag and context and "user_id" in context:
            return (hash(context["user_id"]) % 100) < flag["percentage"]

        return True

Level 2: Flag configuration file with percentage rollout

{
  "new-checkout": {
    "enabled": true,
    "percentage": 10
  }
}

Pros: No redeployment needed. Supports percentage rollout. Simple to implement.

Cons: Each instance reads its own config - no centralized view. Limited targeting capabilities.

Best for: Teams that need gradual rollout but do not want to adopt a third-party service yet.

Level 3: Centralized Flag Service

A dedicated service (self-hosted or SaaS) that manages all flags, provides a dashboard, supports targeting rules, and tracks flag usage.

Examples: LaunchDarkly, Unleash, Flagsmith, Split, or a custom internal service.

Level 3: Centralized flag service with user-context targeting

from feature_flag_client import FlagClient

client = FlagClient(api_key="...")

def process_checkout(cart, user):
    if client.is_enabled("new-checkout", user_context={"id": user.id, "plan": user.plan}):
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Centralized management. Rich targeting (by user, plan, region, etc.). Audit trail. Real-time changes.

Cons: Added dependency. Cost (for SaaS). Network latency for flag evaluation (mitigated by local caching in most SDKs).

Best for: Teams at scale. Products with diverse user segments. Regulated environments needing audit trails.

Level 4: Infrastructure Routing

Instead of checking flags in application code, route traffic at the infrastructure level (load balancer, service mesh, API gateway).

Level 4: Istio VirtualService for infrastructure-level traffic routing

# Istio VirtualService example
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-service
spec:
  hosts:
    - checkout
  http:
    - match:
        - headers:
            x-feature-group:
              exact: "beta"
      route:
        - destination:
            host: checkout-v2
    - route:
        - destination:
            host: checkout-v1

Pros: No application code changes. Clean separation of routing from logic. Works across services.

Cons: Requires infrastructure investment. Less granular than application-level flags. Harder to target individual users.

Best for: Microservice architectures. Service-level rollouts. A/B testing at the infrastructure layer.

Feature Flag Lifecycle

Every feature flag has a lifecycle. Flags that are not actively managed become technical debt. Follow this lifecycle rigorously.

The Stages

Feature flag lifecycle: the stages from create to remove

1. CREATE       → Define the flag, document its purpose and owner
2. DEPLOY OFF   → Code ships to production with the flag disabled
3. BUILD        → Incrementally add functionality behind the flag
4. DARK LAUNCH  → Enable for internal users or a small test group
5. ROLLOUT      → Gradually increase the percentage of users
6. REMOVE       → Delete the flag and the old code path

Stage 1: Create

Before writing any code, define the flag:

Name: Use a consistent naming convention (e.g., enable-new-checkout, feature.discount-engine)
Owner: Who is responsible for this flag through its lifecycle?
Purpose: One sentence describing what the flag controls
Planned removal date: Set this at creation time. Flags without removal dates become permanent.

Stage 2: Deploy OFF

The first deployment includes the flag check but the flag is disabled. This verifies that:

The flag infrastructure works
The default (off) path is unaffected
The flag check does not introduce performance issues

Stage 3: Build Incrementally

Continue building the feature behind the flag over multiple deploys. Each deploy adds more functionality, but the flag remains off for users. Test both paths in your automated suite:

Testing both flag states: parametrize over enabled and disabled

@pytest.mark.parametrize("flag_enabled", [True, False])
def test_checkout_with_flag(flag_enabled, monkeypatch):
    monkeypatch.setattr(flags, "is_enabled", lambda name, ctx=None: flag_enabled)
    result = process_checkout(cart, user)
    assert result.status == "success"

Stage 4: Dark Launch

Enable the flag for internal users or a specific test group. This is your first validation with real production data and real traffic patterns. Monitor:

Error rates for the flagged group vs. control
Performance metrics (latency, throughput)
Business metrics (conversion, engagement)

Stage 5: Gradual Rollout

Increase exposure systematically:

Step	Audience	Duration	What to Watch
1	1% of users	1-2 hours	Error rates, latency
2	5% of users	4-8 hours	Performance at slightly higher load
3	25% of users	1 day	Business metrics begin to be meaningful
4	50% of users	1-2 days	Statistically significant business impact
5	100% of users	-	Full rollout

At any step, if metrics degrade, roll back by disabling the flag. No redeployment needed.

Stage 6: Remove

This is the most commonly skipped step, and skipping it creates significant technical debt.

Once the feature has been stable at 100% for an agreed period (e.g., 2 weeks):

Remove the flag check from code
Remove the old code path
Remove the flag definition from the flag service
Deploy the simplified code

Set a maximum flag lifetime. A common practice is 90 days. Any flag older than 90 days triggers an automatic review. Stale flags are a maintenance burden and a source of confusion.

Lifecycle Timeline Example

Day	Action	Flag State
1	Deploy flag infrastructure and create removal ticket	OFF
2-5	Build feature behind flag, integrate daily	OFF
6	Enable for internal users (dark launch)	ON for 0.1%
7	Enable for 1% of users	ON for 1%
8	Enable for 5% of users	ON for 5%
9	Enable for 25% of users	ON for 25%
10	Enable for 50% of users	ON for 50%
11	Enable for 100% of users	ON for 100%
12-18	Stability period (monitor)	ON for 100%
19-21	Remove flag from code	DELETED

Total lifecycle: approximately 3 weeks from creation to removal.

Long-Lived Feature Flags

Not all flags are temporary. Some flags are intentionally permanent and should be managed differently from release flags.

Operational Flags (Kill Switches)

Purpose: Disable expensive or non-critical features under load during incidents.

Lifecycle: Permanent.

Management: Treat as system configuration, not as a release mechanism.

Operational kill switch: disable expensive features during incidents

# PERMANENT FLAG - System operational control
# Used to disable expensive features during incidents
if flags.is_enabled("enable-recommendations"):
    recommendations = compute_recommendations(user)
else:
    recommendations = []  # Graceful degradation under load

Customer-Specific Toggles

Purpose: Different customers receive different features based on their subscription or contract.

Lifecycle: Permanent, tied to customer configuration.

Management: Part of the customer entitlement system, not the feature flag system.

Customer entitlement toggle: gate features by subscription level

# PERMANENT FLAG - Customer entitlement
# Controlled by customer subscription level
if customer.subscription.includes("analytics"):
    show_advanced_analytics(customer)

Experimentation Flags

Purpose: A/B testing and experimentation.

Lifecycle: The flag infrastructure is permanent, but individual experiments expire.

Management: Each experiment has its own expiration date and success criteria. The experimentation platform itself persists.

Experimentation flag: route users to A/B test variants

# PERMANENT FLAG - Experimentation platform
# Individual experiments expire, platform remains
variant = experiments.get("checkout-optimization")
if variant == "streamlined":
    return streamlined_checkout(cart, user)
else:
    return standard_checkout(cart, user)

Managing Long-Lived Flags

Long-lived flags need different discipline than temporary ones:

Use a separate naming convention (e.g., KILL_SWITCH_*, ENTITLEMENT_*) to distinguish them from temporary release flags
Document why each flag is permanent so future team members understand the intent
Store them separately from temporary flags in your management system
Review regularly to confirm they are still needed

Key Pitfalls

1. “We have 200 feature flags and nobody knows what they all do”

This is flag debt, and it is as damaging as any other technical debt. Prevent it by enforcing the lifecycle: every flag has an owner, a purpose, and a removal date. Run a monthly flag audit.

2. “We use flags for everything, including configuration”

Feature flags and configuration are different concerns. Flags are temporary (they control unreleased features). Configuration is permanent (it controls operational behavior like timeouts, connection pools, log levels). Mixing them leads to confusion about what can be safely removed.

3. “Testing both paths doubles our test burden”

It does increase test effort, but this is a temporary cost. When the flag is removed, the extra tests go away too. The alternative - deploying untested code paths - is far more expensive.

4. “Nested flags create combinatorial complexity”

Avoid nesting flags whenever possible. If feature B depends on feature A, do not create a separate flag for B. Instead, extend the behavior behind feature A’s flag. If you must nest, document the dependency and test the specific combinations that matter.

Flag Removal Anti-Patterns

These specific patterns are the most common ways teams fail at flag cleanup.

Don’t skip the removal ticket:

WRONG: “We’ll remove it later when we have time”
RIGHT: Create a removal ticket at the same time you create the flag

Don’t leave flags after full rollout:

WRONG: Flag still in code 6 months after 100% rollout
RIGHT: Remove within 2-4 weeks of full rollout

Don’t forget to remove the old code path:

WRONG: Flag removed but old implementation still in the codebase
RIGHT: Remove the flag check AND the old implementation together

Don’t keep flags “just in case”:

WRONG: “Let’s keep it in case we need to roll back in the future”
RIGHT: After the stability period, rollback is handled by deployment, not by re-enabling a flag

Measuring Success

Metric	Target	Why It Matters
Active flag count	Stable or decreasing	Confirms flags are being removed, not accumulating
Average flag age	< 90 days	Catches stale flags before they become permanent
Flag-related incidents	Near zero	Confirms flag management is not causing problems
Time from deploy to release	Hours to days (not weeks)	Confirms flags enable fast, controlled releases

Next Step

Small batches and feature flags let you deploy more frequently, but deploying more means more work in progress. Limiting WIP ensures that increased deploy frequency does not create chaos.

Fear of Deploying - a symptom that feature flags help eliminate by making deployments reversible
Infrequent Releases - the symptom of batching releases that flags help break
Small Batches - the practice that feature flags make safe for incomplete work
Progressive Rollout - the deployment strategy that builds on feature flag capabilities
Trunk-Based Development - the branching strategy that feature flags enable
Limiting WIP - the next step after feature flags to manage increased deployment frequency
Hypothesis-Driven Development - using feature flags to control experiment exposure

4.3 - Limiting Work in Progress

Focus on finishing work over starting new work to improve flow and reduce cycle time.

Phase 3 - Optimize

Work in progress (WIP) is inventory. Like physical inventory, it loses value the longer it sits unfinished. Limiting WIP is the most counterintuitive and most impactful practice in this entire migration: doing less work at once makes you deliver more.

Why Limiting WIP Matters

Every item of work in progress has a cost:

Context switching: Moving between tasks destroys focus. Research consistently shows that switching between two tasks reduces productive time by 20-40%.
Delayed feedback: Work that is started but not finished cannot be validated by users. The longer it sits, the more assumptions go untested.
Hidden dependencies: The more items in progress simultaneously, the more likely they are to conflict, block each other, or require coordination.
Longer cycle time: Little’s Law states that cycle time = WIP / throughput. If throughput is constant, the only way to reduce cycle time is to reduce WIP.

“Stop starting, start finishing.”
Lean saying

How to Set Your WIP Limit

The N+2 Starting Point

A practical starting WIP limit for a team is N+2, where N is the number of team members actively working on delivery.

Team Size	Starting WIP Limit	Rationale
3 developers	5 items	Allows one item per person plus a small buffer
5 developers	7 items	Same principle at larger scale
8 developers	10 items	Buffer becomes proportionally smaller

Why N+2 and not N? Because some items will be blocked waiting for review, testing, or external dependencies. A small buffer prevents team members from being idle when their primary task is blocked. But the buffer should be small - two items, not ten.

Continuously Lower the Limit

The N+2 formula is a starting point, not a destination. Once the team is comfortable with the initial limit, reduce it:

Start at N+2. Run for 2-4 weeks. Observe where work gets stuck.
Reduce to N+1. Tighten the limit. Some team members will occasionally be “idle” - this is a feature, not a bug. They should swarm on blocked items.
Reduce to N. At this point, every team member is working on exactly one thing. Blocked work gets immediate attention because someone is always available to help.
Consider going below N. Some teams find that pairing (two people, one item) further reduces cycle time. A team of 6 with a WIP limit of 3 means everyone is pairing.

Each reduction will feel uncomfortable. That discomfort is the point - it exposes problems in your workflow that were previously hidden by excess WIP.

What Happens When You Hit the Limit

When the team reaches its WIP limit and someone finishes a task, they have two options:

Pull the next highest-priority item (if the WIP limit allows it).
Swarm on an existing item that is blocked, stuck, or nearing its cycle time target.

When the WIP limit is reached and no items are complete:

Do not start new work. This is the hardest part and the most important.
Help unblock existing work. Pair with someone. Review a pull request. Write a missing test. Talk to the person who has the answer to the blocking question.
Improve the process. If nothing is blocked but everything is slow, this is the time to work on automation, tooling, or documentation.

Swarming

Swarming is the practice of multiple team members working together on a single item to get it finished faster. It is the natural complement to WIP limits.

When to Swarm

An item has been in progress for longer than the team’s cycle time target (e.g., more than 2 days)
An item is blocked and the blocker can be resolved by another team member
The WIP limit is reached and someone needs work to do
A critical defect needs to be fixed immediately

How to Swarm Effectively

Approach	How It Works	Best For
Pair programming	Two developers work on the same item at the same machine	Complex logic, knowledge transfer, code that needs review
Mob programming	The whole team works on one item together	Critical path items, complex architectural decisions
Divide and conquer	Break the item into sub-tasks and assign them	Items that can be parallelized (e.g., frontend + backend + tests)
Unblock and return	One person resolves the blocker, then hands back	External dependencies, environment issues, access requests

Why Teams Resist Swarming

The most common objection: “It’s inefficient to have two people on one task.” This is only true if you measure efficiency as “percentage of time each person is writing new code.” If you measure efficiency as “how quickly value reaches production,” swarming is almost always faster because it reduces handoffs, wait time, and rework.

How Limiting WIP Exposes Workflow Issues

One of the most valuable effects of WIP limits is that they make hidden problems visible. When you cannot start new work, you are forced to confront the problems that slow existing work down.

Symptom When WIP Is Limited	Root Cause Exposed
“I’m idle because my PR is waiting for review”	Code review process is too slow
“I’m idle because I’m waiting for the test environment”	Not enough environments, or environments are not self-service
“I’m idle because I’m waiting for the product owner to clarify requirements”	Stories are not refined before being pulled into the sprint
“I’m idle because my build is broken and I can’t figure out why”	Build is not deterministic, or test suite is flaky
“I’m idle because another team hasn’t finished the API I depend on”	Architecture is too tightly coupled (see Architecture Decoupling)

Each of these is a bottleneck that was previously invisible because the team could always start something else. With WIP limits, these bottlenecks become obvious and demand attention.

Implementing WIP Limits

Step 1: Make WIP Visible

Before setting limits, make current WIP visible:

Count the number of items currently “in progress” for the team
Write this number on the board (physical or digital) every day
Most teams are shocked by how high it is. A team of 5 often has 15-20 items in progress.

Step 2: Set the Initial Limit

Calculate N+2 for your team
Add the limit to your board (e.g., a column header that says “In Progress (limit: 7)”)
Agree as a team that when the limit is reached, no new work starts

Step 3: Enforce the Limit

When someone tries to pull new work and the limit is reached, the team helps them find an existing item to work on
Track violations: how often does the team exceed the limit? What causes it?
Discuss in retrospectives: Is the limit too high? Too low? What bottlenecks are exposed?

Step 4: Reduce the Limit (Monthly)

Every month, consider reducing the limit by 1
Each reduction will expose new bottlenecks - this is the intended effect
Stop reducing when the team reaches a sustainable flow where items move from start to done predictably

Key Pitfalls

1. “We set a WIP limit but nobody enforces it”

A WIP limit that is not enforced is not a WIP limit. Enforcement requires a team agreement and a visible mechanism. If the board shows 10 items in progress and the limit is 7, the team should stop and address it immediately. This is a working agreement, not a suggestion.

2. “Developers are idle and management is uncomfortable”

This is the most common failure mode. Management sees “idle” developers and concludes WIP limits are wasteful. In reality, those “idle” developers are either swarming on existing work (which is productive) or the team has hit a genuine bottleneck that needs to be addressed. The discomfort is a signal that the system needs improvement.

3. “We have WIP limits but we also have expedite lanes for everything”

If every urgent request bypasses the WIP limit, you do not have a WIP limit. Expedite lanes should be rare - one per week at most. If everything is urgent, nothing is.

4. “We limit WIP per person but not per team”

Per-person WIP limits miss the point. The goal is to limit team WIP so that team members are incentivized to help each other. A per-person limit of 1 with no team limit still allows the team to have 8 items in progress simultaneously with no swarming.

Measuring Success

Metric	Target	Why It Matters
Work in progress	At or below team limit	Confirms the limit is being respected
Development cycle time	Decreasing	Confirms that less WIP leads to faster delivery
Items completed per week	Stable or increasing	Confirms that finishing more, starting less works
Time items spend blocked	Decreasing	Confirms bottlenecks are being addressed

Next Step

WIP limits expose problems. Metrics-Driven Improvement provides the framework for systematically addressing them.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Too Much WIP - the primary symptom that WIP limits address
Work Items Take Too Long - a symptom caused by excess work in progress
PRs Waiting for Review - a bottleneck that WIP limits expose
Unbounded WIP - the anti-pattern of having no limits on work in progress
Push-Based Work Assignment - the anti-pattern of assigning work rather than letting teams pull it
Work in Progress metric - how to measure and track WIP over time

4.4 - Metrics-Driven Improvement

Use DORA metrics and improvement kata to drive systematic delivery improvement.

Phase 3 - Optimize | Original content combining DORA recommendations and improvement kata

Improvement without measurement is guesswork. This page combines the DORA four key metrics with the improvement kata pattern to create a systematic, repeatable approach to getting better at delivery.

The Problem with Ad Hoc Improvement

Most teams improve accidentally. Someone reads a blog post, suggests a change at standup, and the team tries it for a week before forgetting about it. This produces sporadic, unmeasurable progress that is impossible to sustain.

Metrics-driven improvement replaces this with a disciplined cycle: measure where you are, define where you want to be, run a small experiment, measure the result, and repeat. The improvement kata provides the structure. DORA metrics provide the measures.

The Four DORA Metrics

The DORA research program (now part of Google Cloud) has identified four key metrics that predict software delivery performance. These are the metrics you should track throughout your CD migration.

1. Deployment Frequency

How often your team deploys to production.

Performance Level	Deployment Frequency
Elite	On-demand (multiple deploys per day)
High	Between once per day and once per week
Medium	Between once per week and once per month
Low	Between once per month and once every six months

What it tells you: How comfortable your team and pipeline are with deploying. Low frequency usually indicates manual gates, fear of deployment, or large batch sizes.

How to measure: Count the number of successful deployments to production per unit of time. Automated deploys count. Hotfixes count. Rollbacks do not.

2. Lead Time for Changes

The time from a commit being pushed to trunk to that commit running in production.

Performance Level	Lead Time
Elite	Less than one hour
High	Between one day and one week
Medium	Between one week and one month
Low	Between one month and six months

What it tells you: How efficient your pipeline is. Long lead times indicate slow builds, manual approval steps, or infrequent deployment windows.

How to measure: Record the timestamp when a commit merges to trunk and the timestamp when that commit is running in production. The difference is lead time. Track the median, not the mean (outliers distort the mean).

3. Change Failure Rate

The percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, or patch).

Performance Level	Change Failure Rate
Elite	0-15%
High	16-30%
Medium	16-30%
Low	46-60%

What it tells you: How effective your testing and validation pipeline is. High failure rates indicate gaps in test coverage, insufficient pre-production validation, or overly large changes.

How to measure: Track deployments that result in a degraded service, require rollback, or need a hotfix. Divide by total deployments. A “failure” is defined by the team - typically any incident that requires immediate human intervention.

4. Mean Time to Restore (MTTR)

How long it takes to recover from a failure in production.

Performance Level	Time to Restore
Elite	Less than one hour
High	Less than one day
Medium	Less than one day
Low	Between one week and one month

What it tells you: How resilient your system and team are. Long recovery times indicate manual rollback processes, poor observability, or insufficient incident response practices.

How to measure: Record the timestamp when a production failure is detected and the timestamp when service is fully restored. Track the median.

CI Health Metrics

DORA metrics are outcome metrics - they tell you how delivery is performing overall. CI health metrics are leading indicators that give you earlier feedback on the health of your integration practices. Problems in these metrics show up days or weeks before they surface in DORA numbers.

Track these alongside DORA metrics to catch issues before they compound.

Commits Per Day Per Developer

Aspect	Detail
What it measures	The average number of commits integrated to trunk per developer per day
How to measure	Count trunk commits (or merged pull requests) over a period and divide by the number of active developers and working days
Good target	2 or more per developer per day
Why it matters	Low commit frequency indicates large batch sizes, long-lived branches, or developers waiting to integrate. All of these increase merge risk and slow feedback.

If the number is low: Developers may be working on branches for too long, bundling unrelated changes into single commits, or facing barriers to integration (slow builds, complex merge processes). Investigate branch lifetimes and work decomposition.

If the number is unusually high: Verify that commits represent meaningful work rather than trivial fixes to pass a metric. Commit frequency is a means to smaller batches, not a goal in itself.

Build Success Rate

Aspect	Detail
What it measures	The percentage of CI builds that pass on the first attempt
How to measure	Divide the number of green builds by total builds over a period
Good target	90% or higher
Why it matters	A frequently broken build disrupts the entire team. Developers cannot integrate confidently when the build is unreliable, leading to longer feedback cycles and batching of changes.

If the number is low: Common causes include flaky tests, insufficient local validation before committing, or environmental inconsistencies between developer machines and CI. Start by identifying and quarantining flaky tests, then ensure developers can run a representative build locally before pushing.

If the number is high but DORA metrics are still lagging: The build may pass but take too long, or the build may not cover enough to catch real problems. Check build duration and test coverage.

Time to Fix a Broken Build

Aspect	Detail
What it measures	The elapsed time from a build breaking to the next green build on trunk
How to measure	Record the timestamp of the first red build and the timestamp of the next green build. Track the median.
Good target	Less than 10 minutes
Why it matters	A broken build blocks everyone. The longer it stays broken, the more developers stack changes on top of a broken baseline, compounding the problem. Fast fix times are a sign of strong CI discipline.

If the number is high: The team may not be treating broken builds as a stop-the-line event. Establish a team agreement: when the build breaks, fixing it takes priority over all other work. If builds break frequently and take long to fix, reduce change size so failures are easier to diagnose.

The DORA Recommended Practices

Behind these four metrics are 24 practices that the DORA research has shown to drive performance. They organize into five categories. Use this as a diagnostic tool: when a metric is lagging, look at the related practices to identify what to improve.

Continuous Delivery Practices

These directly affect your pipeline and deployment practices:

Version control for all production artifacts
Automated deployment processes
Continuous integration
Trunk-based development
Test automation
Test data management
Shift-left security
Continuous delivery (the ability to deploy at any time)

Architecture Practices

These affect how easily your system can be changed and deployed:

Loosely coupled architecture
Empowered teams that can choose their own tools
Teams that can test, deploy, and release independently

Product and Process Practices

These affect how work flows through the team:

Customer feedback loops
Value stream visibility
Working in small batches
Team experimentation

Lean Management Practices

These affect how the organization supports delivery:

Lightweight change approval processes
Monitoring and observability
Proactive notification
WIP limits
Visual management of workflow

Cultural Practices

These affect the environment in which teams operate:

Generative organizational culture (Westrum model)
Encouraging and supporting learning
Collaboration within and between teams
Job satisfaction
Transformational leadership

For a detailed breakdown, see the DORA Recommended Practices reference.

The Improvement Kata

The improvement kata is a four-step pattern from lean manufacturing adapted for software delivery. It provides the structure for turning DORA measurements into concrete improvements.

Step 1: Understand the Direction

Where does your CD migration need to go?

This is already defined by the phases of this migration guide. In Phase 3, your direction is: smaller batches, faster flow, and higher confidence in every deployment.

Step 2: Grasp the Current Condition

Measure your current DORA metrics. Be honest - the point is to understand reality, not to look good.

Practical approach:

Collect two weeks of data for all four DORA metrics
Plot the data - do not just calculate averages. Look at the distribution.
Identify which metric is furthest from your target
Investigate the related practices to understand why

Example current condition:

Metric	Current	Target	Gap
Deployment frequency	Weekly	Daily	5x improvement needed
Lead time	3 days	< 1 day	Pipeline is slow or has manual gates
Change failure rate	25%	< 15%	Test coverage or change size issue
MTTR	4 hours	< 1 hour	Rollback is manual

Step 3: Establish the Next Target Condition

Do not try to fix everything at once. Pick one metric and define a specific, measurable, time-bound target.

Good target: “Reduce lead time from 3 days to 1 day within the next 4 weeks.”

Bad target: “Improve our deployment pipeline.” (Too vague, no measure, no deadline.)

Step 4: Experiment Toward the Target

Design a small experiment that you believe will move the metric toward the target. Run it. Measure the result. Adjust.

The experiment format:

Element	Description
Hypothesis	“If we [action], then [metric] will [improve/decrease] because [reason].”
Action	What specifically will you change?
Duration	How long will you run the experiment? (Typically 1-2 weeks)
Measure	How will you know if it worked?
Decision criteria	What result would cause you to keep, modify, or abandon the change?

Example experiment:

Hypothesis: If we parallelize our integration test suite, lead time will drop from 3 days to under 2 days because 60% of lead time is spent waiting for tests to complete.
Action: Split the integration test suite into 4 parallel runners.
Duration: 2 weeks.
Measure: Median lead time for commits merged during the experiment period.
Decision criteria: Keep if lead time drops below 2 days. Modify if it drops but not enough. Abandon if it has no effect or introduces flakiness.

The Cycle Repeats

After each experiment:

Measure the result
Update your understanding of the current condition
If the target is met, pick the next metric to improve
If the target is not met, design another experiment

This creates a continuous improvement loop. Each cycle takes 1-2 weeks. Over months, the cumulative effect is dramatic.

Connecting Metrics to Action

When a metric is lagging, use this guide to identify where to focus.

Low Deployment Frequency

Possible Cause	Investigation	Action
Manual approval gates	Map the approval chain	Automate or eliminate non-value-adding approvals
Fear of deployment	Ask the team what they fear	Address the specific fear (usually testing gaps)
Large batch size	Measure changes per deploy	Implement small batches practices
Deploy process is manual	Time the deploy process	Automate the deployment pipeline

Long Lead Time

Possible Cause	Investigation	Action
Slow builds	Time each pipeline stage	Optimize the slowest stage (often tests)
Waiting for environments	Track environment wait time	Implement self-service environments
Waiting for approval	Track approval wait time	Reduce approval scope or automate
Large changes	Measure commit size	Reduce batch size

High Change Failure Rate

Possible Cause	Investigation	Action
Insufficient test coverage	Measure coverage by area	Add tests for the areas that fail most
Tests pass but production differs	Compare test and prod environments	Make environments more production-like
Large, risky changes	Measure change size	Reduce batch size, use feature flags
Configuration drift	Audit configuration differences	Externalize and version configuration

Long MTTR

Possible Cause	Investigation	Action
Rollback is manual	Time the rollback process	Automate rollback
Hard to identify root cause	Review recent incidents	Improve observability and alerting
Hard to deploy fixes quickly	Measure fix lead time	Ensure pipeline supports rapid hotfix deployment
Dependencies fail in cascade	Map failure domains	Improve architecture decoupling

Pipeline Visibility

Metrics only drive improvement when people see them. Pipeline visibility means making the current state of your build and deployment pipeline impossible to ignore. When the build is red, everyone should know immediately - not when someone checks a dashboard twenty minutes later.

Making Build Status Visible

The most effective teams use ambient visibility - information that is passively available without anyone needing to seek it out.

Build radiators: A large monitor in the team area showing the current pipeline status. Green means the build is passing. Red means it is broken. The radiator should be visible from every desk in the team space. For remote teams, a persistent widget in the team chat channel serves the same purpose.

Browser extensions and desktop notifications: Tools like CCTray, BuildNotify, or CI server plugins can display build status in the system tray or browser toolbar. These provide individual-level ambient awareness without requiring a shared physical space.

Chat integrations: Post build results to the team channel automatically. Keep these concise - a green checkmark or red alert with a link to the build is enough. Verbose build logs in chat become noise.

Notification Best Practices

Notifications are powerful when used well and destructive when overused. The goal is to notify the right people at the right time with the right level of urgency.

When to notify:

Build breaks on trunk - notify the whole team immediately
Build is fixed - notify the whole team (this is a positive signal worth reinforcing)
Deployment succeeds - notify the team channel (low urgency)
Deployment fails - notify the on-call and the person who triggered it

When not to notify:

Every commit or pull request update (too noisy)
Successful builds on feature branches (nobody else needs to know)
Metrics that have not changed (no signal in “things are the same”)

Avoiding notification fatigue: If your team ignores notifications, you have too many of them. Audit your notification channels quarterly. Remove any notification that the team consistently ignores. A notification that nobody reads is worse than no notification at all - it trains people to tune out the channel entirely.

Building a Metrics Dashboard

Make your DORA metrics and CI health metrics visible to the team at all times. A dashboard on a wall monitor or a shared link is ideal.

Essential Information

Organize your dashboard around three categories:

Current status - what is happening right now:

Pipeline status (green/red) for trunk and any active deployments
Current values for all four DORA metrics
Active experiment description and target condition

Trends - where are we heading:

Trend lines showing direction over the past 4-8 weeks
CI health metrics (build success rate, time to fix, commit frequency) plotted over time
Whether the current improvement target is on track

Team health - how is the team doing:

Current improvement target highlighted
Days since last production incident
Number of experiments completed this quarter

Dashboard Anti-Patterns

The vanity dashboard: Displays only metrics that look good. If your dashboard never shows anything concerning, it is not useful. Include metrics that challenge the team, not just ones that reassure management.

The everything dashboard: Crams dozens of metrics, charts, and tables onto one screen. Nobody can parse it at a glance, so nobody looks at it. Limit your dashboard to 6-8 key indicators. If you need more detail, put it on a drill-down page.

The stale dashboard: Data is updated manually and falls behind. Automate data collection wherever possible. A dashboard showing last month’s numbers is worse than no dashboard - it creates false confidence.

The blame dashboard: Ties metrics to individual developers rather than teams. This creates fear and gaming rather than improvement. Always present metrics at the team level.

Keep it simple. A spreadsheet updated weekly is better than a sophisticated dashboard that nobody maintains. The goal is visibility, not tooling sophistication.

Key Pitfalls

1. “We measure but don’t act”

Measurement without action is waste. If you collect metrics but never run experiments, you are creating overhead with no benefit. Every measurement should lead to a hypothesis. Every hypothesis should lead to an experiment. See Hypothesis-Driven Development for the full lifecycle.

2. “We use metrics to compare teams”

DORA metrics are for teams to improve themselves, not for management to rank teams. Using metrics for comparison creates incentives to game the numbers. Each team should own its own metrics and its own improvement targets.

3. “We try to improve all four metrics at once”

Focus on one metric at a time. Improving deployment frequency and change failure rate simultaneously often requires conflicting actions. Pick the biggest bottleneck, address it, then move to the next.

4. “We abandon experiments too quickly”

Most experiments need at least two weeks to show results. One bad day is not a reason to abandon an experiment. Set the duration up front and commit to it.

Measuring Success

Indicator	Target	Why It Matters
Experiments per month	2-4	Confirms the team is actively improving
Metrics trending in the right direction	Consistent improvement over 3+ months	Confirms experiments are having effect
Team can articulate current condition and target	Everyone on the team knows	Confirms improvement is a shared concern
Improvement items in backlog	Always present	Confirms improvement is treated as a deliverable

Next Step

Metrics tell you what to improve. Retrospectives provide the team forum for deciding how to improve it.

Deployment Frequency - one of the four key DORA metrics
Lead Time - one of the four key DORA metrics
Change Fail Rate - one of the four key DORA metrics
Mean Time to Repair - one of the four key DORA metrics
DORA Recommended Practices - the 24 practices that drive delivery performance
Retrospectives - the team forum for acting on what metrics reveal
Hypothesis-Driven Development - the practice of treating every change as a testable experiment

4.5 - Retrospectives

Continuously improve the delivery process through structured reflection.

Phase 3 - Optimize

A retrospective is the team’s primary mechanism for turning observations into improvements. Without effective retrospectives, WIP limits expose problems that nobody addresses, metrics trend in the wrong direction with no response, and the CD migration stalls.

Why Retrospectives Matter for CD Migration

Every practice in this guide - trunk-based development, small batches, WIP limits, metrics-driven improvement - generates signals about what is working and what is not. Retrospectives are where the team processes those signals and decides what to change.

Teams that skip retrospectives or treat them as a checkbox exercise consistently stall at whatever maturity level they first reach. Teams that run effective retrospectives continuously improve, week after week, month after month.

The Five-Part Structure

An effective retrospective follows a structured format that prevents it from devolving into a venting session or a status meeting. This five-part structure ensures the team moves from observation to action.

Part 1: Review the Mission (5 minutes)

Start by reminding the team of the larger goal. In the context of a CD migration, this might be:

“Our mission this quarter is to deploy to production at least once per day.”
“We are working toward eliminating manual gates in our pipeline.”
“Our goal is to reduce lead time from 3 days to under 1 day.”

This grounding prevents the retrospective from focusing on minor irritations and keeps the conversation aligned with what matters.

Part 2: Review the KPIs (10 minutes)

Present the team’s current metrics. For a CD migration, these are typically the DORA metrics plus any team-specific measures from Metrics-Driven Improvement.

Metric	Last Period	This Period	Trend
Deployment frequency	3/week	4/week	Improving
Lead time (median)	2.5 days	2.1 days	Improving
Change failure rate	22%	18%	Improving
MTTR	3 hours	3.5 hours	Declining
WIP (average)	8 items	6 items	Improving

Do not skip this step. Without data, the retrospective becomes a subjective debate where the loudest voice wins. With data, the conversation focuses on what the numbers show and what to do about them.

Part 3: Review Experiments (10 minutes)

Review the outcomes of any experiments the team ran since the last retrospective.

For each experiment:

What was the hypothesis? Remind the team what you were testing.
What happened? Present the data.
What did you learn? Even failed experiments teach you something.
What is the decision? Keep, modify, or abandon.

Example:

Experiment: Parallelize the integration test suite to reduce lead time.
Hypothesis: Lead time would drop from 2.5 days to under 2 days.
Result: Lead time dropped to 2.1 days. The parallelization worked, but environment setup time is now the bottleneck.
Decision: Keep the parallelization. New experiment: investigate self-service test environments.

Part 4: Check Goals (10 minutes)

Review any improvement goals or action items from the previous retrospective.

Completed: Acknowledge and celebrate. This is important - it reinforces that improvement work matters.
In progress: Check for blockers. Does the team need to adjust the approach?
Not started: Why not? Was it deprioritized, blocked, or forgotten? If improvement work is consistently not started, the team is not treating improvement as a deliverable (see below).

Part 5: Open Conversation (25 minutes)

This is the core of the retrospective. The team discusses:

What is working well that we should keep doing?
What is not working that we should change?
What new problems or opportunities have we noticed?

Facilitation techniques for this section:

Technique	How It Works	Best For
Start/Stop/Continue	Each person writes items in three categories	Quick, structured, works with any team
4Ls (Liked, Learned, Lacked, Longed For)	Broader categories that capture emotional responses	Teams that need to process frustration or celebrate wins
Timeline	Plot events on a timeline and discuss turning points	After a particularly eventful sprint or incident
Dot voting	Everyone gets 3 votes to prioritize discussion topics	When there are many items and limited time

From Conversation to Commitment

The open conversation must produce concrete action items. Vague commitments like “we should communicate better” are worthless. Good action items are:

Specific: “Add a Slack notification when the build breaks” (not “improve communication”)
Owned: “Alex will set this up by Wednesday” (not “someone should do this”)
Measurable: “We will know this worked if build break response time drops below 10 minutes”
Time-bound: “We will review the result at the next retrospective”

Limit action items to 1-3 per retrospective. More than three means nothing gets done. One well-executed improvement is worth more than five abandoned ones.

Psychological Safety Is a Prerequisite

A retrospective only works if team members feel safe to speak honestly about what is not working. Without psychological safety, retrospectives produce sanitized, non-actionable discussion.

Signs of Low Psychological Safety

Only senior team members speak
Nobody mentions problems - everything is “fine”
Issues that everyone knows about are never raised
Team members vent privately after the retrospective instead of during it
Action items are always about tools or processes, never about behaviors

Building Psychological Safety

Practice	Why It Helps
Leader speaks last	Prevents the leader’s opinion from anchoring the discussion
Anonymous input	Use sticky notes or digital tools where input is anonymous initially
Blame-free language	“The deploy failed” not “You broke the deploy”
Follow through on raised issues	Nothing destroys safety faster than raising a concern and having it ignored
Acknowledge mistakes openly	Leaders who admit their own mistakes make it safe for others to do the same
Separate retrospective from performance review	If retro content affects reviews, people will not be honest

Treat Improvement as a Deliverable

The most common failure mode for retrospectives is producing action items that never get done. This happens when improvement work is treated as something to do “when we have time” - which means never.

Make Improvement Visible

Add improvement items to the same board as feature work
Include improvement items in WIP limits
Track improvement items through the same workflow as any other deliverable

Allocate Capacity

Reserve a percentage of team capacity for improvement work. Common allocations:

Allocation	Approach
20% continuous	One day per week (or equivalent) dedicated to improvement, tooling, and tech debt
Dedicated improvement sprint	Every 4th sprint is entirely improvement-focused
Improvement as first pull	When someone finishes work and the WIP limit allows, the first option is an improvement item

The specific allocation matters less than having one. A team that explicitly budgets 10% for improvement will improve more than a team that aspires to 20% but never protects the time.

Retrospective Cadence

Cadence	Best For	Caution
Weekly	Teams in active CD migration, teams working through major changes	Can feel like too many meetings if not well-facilitated
Bi-weekly	Teams in steady state with ongoing improvement	Most common cadence
After incidents	Any team	Incident retrospectives (postmortems) are separate from regular retrospectives
Monthly	Mature teams with well-established improvement habits	Too infrequent for teams early in their migration

During active phases of a CD migration (Phases 1-3), weekly retrospectives are recommended. Once the team reaches Phase 4, bi-weekly is usually sufficient.

Running Your First CD Migration Retrospective

If your team has not been running effective retrospectives, start here:

Before the Retrospective

Collect your DORA metrics for the past two weeks
Review any action items from the previous retrospective (if applicable)
Prepare a shared document or board with the five-part structure

During the Retrospective (60 minutes)

Review mission (5 min): State your CD migration goal for this phase
Review KPIs (10 min): Present the DORA metrics. Ask: “What do you notice?”
Review experiments (10 min): Discuss any experiments that were run
Check goals (10 min): Review action items from last time
Open conversation (25 min): Use Start/Stop/Continue for the first time - it is the simplest format

After the Retrospective

Publish the action items where the team will see them daily
Assign owners and due dates
Add improvement items to the team board
Schedule the next retrospective

Key Pitfalls

1. “Our retrospectives always produce the same complaints”

If the same issues surface repeatedly, the team is not executing on its action items. Check whether improvement work is being prioritized alongside feature work. If it is not, no amount of retrospective technique will help.

2. “People don’t want to attend because nothing changes”

This is a symptom of the same problem - action items are not executed. The fix is to start small: commit to one action item per retrospective, execute it completely, and demonstrate the result at the next retrospective. Success builds momentum.

3. “The retrospective turns into a blame session”

The facilitator must enforce blame-free language. Redirect “You did X wrong” to “When X happened, the impact was Y. How can we prevent Y?” If blame is persistent, the team has a psychological safety problem that needs to be addressed separately.

4. “We don’t have time for retrospectives”

A team that does not have time to improve will never improve. A 60-minute retrospective that produces one executed improvement is the highest-leverage hour of the entire sprint.

Measuring Success

Indicator	Target	Why It Matters
Retrospective attendance	100% of team	Confirms the team values the practice
Action items completed	> 80% completion rate	Confirms improvement is treated as a deliverable
DORA metrics trend	Improving quarter over quarter	Confirms retrospectives lead to real improvement
Team engagement	Voluntary contributions increasing	Confirms psychological safety is present

Next Step

With metrics-driven improvement and effective retrospectives, you have the engine for continuous improvement. The final optimization step is Architecture Decoupling - ensuring your system’s architecture does not prevent you from deploying independently.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Team Burnout - a symptom that effective retrospectives help detect and address early
Deadline-Driven Development - an anti-pattern that retrospectives can surface and challenge
Velocity as Individual Metric - an anti-pattern that undermines the psychological safety retrospectives require
Metrics-Driven Improvement - provides the data that retrospectives use to drive decisions
Limiting WIP - WIP limits expose problems that retrospectives turn into action items
DORA Recommended Practices - the capability framework that informs improvement priorities

4.6 - Architecture Decoupling

Enable independent deployment of components by decoupling architecture boundaries.

Phase 3 - Optimize | Original content based on Dojo Consortium delivery journey patterns

You cannot deploy independently if your architecture requires coordinated releases. This page describes the three architecture states teams encounter on the journey to continuous deployment and provides practical strategies for moving from entangled to loosely coupled.

Why Architecture Matters for CD

Every practice in this guide - small batches, feature flags, WIP limits - assumes that your team can deploy its changes independently. But if your application is a monolith where changing one module requires retesting everything, or a set of microservices with tightly coupled APIs, independent deployment is impossible regardless of how good your practices are.

Architecture is either an enabler or a blocker for continuous deployment. There is no neutral.

Three Architecture States

The Delivery System Improvement Journey describes three states that teams move through. Most teams start entangled. The goal is to reach loosely coupled.

State 1: Entangled

In an entangled architecture, everything is connected to everything. Changes in one area routinely break other areas. Teams cannot deploy independently.

Characteristics:

Shared database schemas with no ownership boundaries
Circular dependencies between modules or services
Deploying one service requires deploying three others at the same time
Integration testing requires the entire system to be running
A single team’s change can block every other team’s release
“Big bang” releases on a fixed schedule

Impact on delivery:

Metric	Typical State
Deployment frequency	Monthly or quarterly (because coordinating releases is hard)
Lead time	Weeks to months (because changes wait for the next release train)
Change failure rate	High (because big releases mean big risk)
MTTR	Long (because failures cascade across boundaries)

How you got here: Entanglement is the natural result of building quickly without deliberate architectural boundaries. It is not a failure - it is a stage that almost every system passes through.

State 2: Tightly Coupled

In a tightly coupled architecture, there are identifiable boundaries between components, but those boundaries are leaky. Teams have some independence, but coordination is still required for many changes.

Characteristics:

Services exist but share a database or use synchronous point-to-point calls
API contracts exist but are not versioned - breaking changes require simultaneous updates
Teams can deploy some changes independently, but cross-cutting changes require coordination
Integration testing requires multiple services but not the entire system
Release trains still exist but are smaller and more frequent

Impact on delivery:

Metric	Typical State
Deployment frequency	Weekly to bi-weekly
Lead time	Days to a week
Change failure rate	Moderate (improving but still affected by coupling)
MTTR	Hours (failures are more isolated but still cascade sometimes)

State 3: Loosely Coupled

In a loosely coupled architecture, components communicate through well-defined interfaces, own their own data, and can be deployed independently without coordinating with other teams.

Characteristics:

Each service owns its own data store - no shared databases
APIs are versioned; consumers and producers can be updated independently
Asynchronous communication (events, queues) is used where possible
Each team can deploy without coordinating with any other team
Services are designed to degrade gracefully if a dependency is unavailable
No release trains - each team deploys when ready

Impact on delivery:

Metric	Typical State
Deployment frequency	On-demand (multiple times per day)
Lead time	Hours
Change failure rate	Low (small, isolated changes)
MTTR	Minutes (failures are contained within service boundaries)

Moving from Entangled to Tightly Coupled

This is the first and most difficult transition. It requires establishing boundaries where none existed before.

Strategy 1: Identify Natural Seams

Look for places where the system already has natural boundaries, even if they are not enforced:

Different business domains: Orders, payments, inventory, and user accounts are different domains even if they live in the same codebase.
Different rates of change: Code that changes weekly and code that changes yearly should not be in the same deployment unit.
Different scaling needs: Components with different load profiles benefit from separate deployment.
Different team ownership: If different teams work on different parts of the codebase, those parts are candidates for separation.

Strategy 2: Strangler Fig Pattern

Instead of rewriting the system, incrementally extract components from the monolith.

Strangler Fig Pattern: incremental extraction steps

Step 1: Route all traffic through a facade/proxy
Step 2: Build the new component alongside the old
Step 3: Route a small percentage of traffic to the new component
Step 4: Validate correctness and performance
Step 5: Route all traffic to the new component
Step 6: Remove the old code

Key rule: The strangler fig pattern must be done incrementally. If you try to extract everything at once, you are doing a rewrite, not a strangler fig.

Strategy 3: Define Ownership Boundaries

Assign clear ownership of each module or component to a single team. Ownership means:

The owning team decides the API contract
The owning team deploys the component
Other teams consume the API, not the internal implementation
Changes to the API contract require agreement from consumers (but not simultaneous deployment)

What to Avoid

The “big rewrite”: Rewriting a monolith from scratch almost always fails. Use the strangler fig pattern instead.
Premature microservices: Do not split into microservices until you have clear domain boundaries and team ownership. Microservices with unclear boundaries are a distributed monolith - the worst of both worlds.
Shared databases across services: This is the most common coupling mechanism. If two services share a database, they cannot be deployed independently because a schema change in one service can break the other.

Moving from Tightly Coupled to Loosely Coupled

This transition is about hardening the boundaries that were established in the previous step.

Strategy 1: Eliminate Shared Data Stores

If two services share a database, one of three things needs to happen:

One service owns the data, the other calls its API. The dependent service no longer accesses the database directly.
The data is duplicated. Each service maintains its own copy, synchronized via events.
The shared data becomes a dedicated data service. Both services consume from a service that owns the data.

Eliminating shared databases: before and after patterns

BEFORE (shared database):
  Service A → [Shared DB] ← Service B

AFTER (option 1 - API ownership):
  Service A → [DB A]
  Service B → Service A API → [DB A]

AFTER (option 2 - event-driven duplication):
  Service A → [DB A] → Events → Service B → [DB B]

AFTER (option 3 - data service):
  Service A → Data Service → [DB]
  Service B → Data Service → [DB]

Strategy 2: Version Your APIs

API versioning allows consumers and producers to evolve independently.

Rules for API versioning:

Never make a breaking change without a new version. Adding fields is non-breaking. Removing fields is breaking. Changing field types is breaking.
Support at least two versions simultaneously. This gives consumers time to migrate.
Deprecate old versions with a timeline. “Version 1 will be removed on date X.”
Use consumer-driven contract tests to verify compatibility. See Contract Testing.

Strategy 3: Prefer Asynchronous Communication

Synchronous calls (HTTP, gRPC) create temporal coupling: if the downstream service is slow or unavailable, the upstream service is also affected.

Communication Style	Coupling	When to Use
Synchronous (HTTP/gRPC)	Temporal + behavioral	When the caller needs an immediate response
Asynchronous (events/queues)	Behavioral only	When the caller does not need an immediate response
Event-driven (publish/subscribe)	Minimal	When the producer does not need to know about consumers

Prefer asynchronous communication wherever the business requirements allow it. Not every interaction needs to be synchronous.

Strategy 4: Design for Failure

In a loosely coupled system, dependencies will be unavailable sometimes. Design for this:

Circuit breakers: Stop calling a failing dependency after N failures. Return a degraded response instead.
Timeouts: Set aggressive timeouts on all external calls. A 30-second timeout on a service that should respond in 100ms is not a timeout - it is a hang.
Bulkheads: Isolate failures so that one failing dependency does not consume all resources.
Graceful degradation: Define what the user experience should be when a dependency is down. “Recommendations unavailable” is better than a 500 error.

Practical Steps for Architecture Decoupling

Step 1: Map Dependencies

Before changing anything, understand what you have:

Draw a dependency graph. Which components depend on which? Where are the shared databases?
Identify deployment coupling. Which components must be deployed together? Why?
Identify the highest-impact coupling. Which coupling most frequently blocks independent deployment?

Step 2: Establish the First Boundary

Pick one component to decouple. Choose the one with the highest impact and lowest risk:

Apply the strangler fig pattern to extract it
Define a clear API contract
Move its data to its own data store
Deploy it independently

Step 3: Repeat

Take the next highest-impact coupling and address it. Each decoupling makes the next one easier because the team learns the patterns and the remaining system is simpler.

Key Pitfalls

1. “We need to rewrite everything before we can deploy independently”

No. Decoupling is incremental. Extract one component, deploy it independently, prove the pattern works, then continue. A partial decoupling that enables one team to deploy independently is infinitely more valuable than a planned rewrite that never finishes.

2. “We split into microservices but our lead time got worse”

Microservices add operational complexity (more services to deploy, monitor, and debug). If you split without investing in deployment automation, observability, and team autonomy, you will get worse, not better. Microservices are a tool for organizational scaling, not a silver bullet for delivery speed.

3. “Teams keep adding new dependencies that recouple the system”

Architecture decoupling requires governance. Establish architectural principles (e.g., “no shared databases”) and enforce them through automated checks (e.g., dependency analysis in CI) and architecture reviews for cross-boundary changes.

4. “We can’t afford the time to decouple”

You cannot afford not to. Every week spent doing coordinated releases is a week of delivery capacity lost to coordination overhead. The investment in decoupling pays for itself quickly through increased deployment frequency and reduced coordination cost.

Measuring Success

Metric	Target	Why It Matters
Teams that can deploy independently	Increasing	The primary measure of decoupling
Coordinated releases per quarter	Decreasing toward zero	Confirms coupling is being eliminated
Deployment frequency per team	Increasing independently	Confirms teams are not blocked by each other
Cross-team dependencies per feature	Decreasing	Confirms architecture supports independent work

Next Step

With optimized flow, small batches, metrics-driven improvement, and a decoupled architecture, your team is ready for the final phase. Continue to Phase 4: Deliver on Demand.

Coordinated Deployments - the primary symptom that architecture coupling causes
Tightly Coupled Monolith - the anti-pattern of a monolith with no internal boundaries
Distributed Monolith - the anti-pattern of microservices that still require coordinated releases
Premature Microservices - splitting into services before domain boundaries are clear
Contract Testing - the testing approach that enables independent deployment of services
Progressive Rollout - the deployment strategy enabled by a decoupled architecture
Team Alignment to Code - the organizational counterpart: matching team boundaries to the code boundaries that decoupling creates

4.7 - Team Alignment to Code

Match team ownership boundaries to code boundaries so each team can build, test, and deploy its domain independently.

Phase 3 - Optimize | Teams that own a domain end-to-end can deploy independently. Teams organized around technical layers cannot.

How Team Structure Shapes Code

The way an organization communicates produces the architecture it builds. When communication flows between layers - frontend team talks to backend team, backend team talks to database team - the software reflects those communication lines. Requests for the UI layer go to one team. Requests for the API layer go to another. The result is software that is horizontally layered in the same pattern as the organization.

Layer teams produce layered architectures. The layers are coupled not because the engineers chose to couple them but because every feature requires coordination across team boundaries. The coupling is structural, not accidental.

Domain teams produce domain boundaries. When one team owns everything inside a business domain - the user interface, the business logic, the data store, and the deployment pipeline - they can make changes within that domain without coordinating with other teams. The interfaces between domains are explicit and stable because that is how the teams communicate.

This is not a coincidence. Architecture reflects the ownership structure of the people who built it.

What Aligned Ownership Looks Like

A team with aligned ownership can answer yes to all of the following:

Can this team deploy a change to production without waiting for another team?
Does this team own everything inside its domain boundary - all layers, all data, and all consumer interfaces?
Does this team define and version the contracts its domain exposes to other domains?
Is this team responsible for production incidents in its domain?

Two team patterns achieve aligned ownership in practice.

A full-stack product team owns the complete user-facing surface for a feature area - from the UI components a user interacts with down through the business logic and the database. The team has no hard dependency on a separate frontend or backend team. One team ships the entire vertical slice.

A subdomain product team owns a service or set of services representing a bounded business capability. Some subdomain teams own a user-facing surface alongside their backend logic. Others - a tax calculation service, a shipping rates engine, an identity provider - have no UI at all. Their consumer interface is entirely an API, consumed by other teams rather than by end users directly. Both are fully aligned: the team owns everything within the boundary, and the boundary is what its consumers depend on - whether that is a UI, an API, or both. A slice is done when the consumer interface satisfies the agreed behavior for its callers.

Both patterns share the same structure: one team, one deployable, full ownership. The team owns all layers within its boundary, the authority to deploy that boundary independently, and accountability for its operational behavior.

What Misalignment Looks Like

Three patterns consistently produce deployment coupling.

Component or layer teams. A frontend team, a backend team, and a database team all work on the same product. Every feature requires coordination across all three. No team can deploy independently because no team owns a full vertical slice.

Feature teams without domain ownership. Teams are organized around feature areas, but each feature area spans multiple services owned by other teams. The feature team coordinates with service owners for every change. The service owners become a shared resource that feature teams queue against.

The pillar model. A platform team owns all infrastructure. A shared services team owns cross-cutting concerns. Product teams own the business logic but depend on the other two for deployment. A change that touches infrastructure or shared services requires the product team to file a ticket and wait.

The telltale sign in all three cases: a team cannot estimate their own delivery date because it depends on other teams’ schedules.

The Relationship Between Team Alignment and Architecture

Team alignment and architecture reinforce each other. A decoupled architecture makes it possible to draw clean team boundaries. Clean team boundaries prevent the architecture from recoupling.

When team boundaries and code boundaries match:

Each team modifies code that only they own. Merge conflicts between teams disappear.
Each team’s pipeline validates only their domain. Shared pipeline queues disappear.
Each team deploys on their own schedule. Release trains disappear.

When they do not match, architecture and ownership drift together. A team that technically “owns” a service but in practice coordinates with three other teams for every change is not an independent deployment unit regardless of what the org chart says.

See Architecture Decoupling for the technical strategies to establish independent service boundaries. See Tightly Coupled Monolith for the architecture anti-pattern that misaligned ownership produces over time.

graph TD
    classDef aligned fill:#0d7a32,stroke:#0a6128,color:#fff
    classDef misaligned fill:#a63123,stroke:#8a2518,color:#fff
    classDef boundary fill:#224968,stroke:#1a3a54,color:#fff

    subgraph good ["Aligned: Domain Teams"]
        G1["Payments Team\nUI + Logic + DB + Pipeline"]:::aligned
        G2["Inventory Team\nUI + Logic + DB + Pipeline"]:::aligned
        G3["Accounts Team\nUI + Logic + DB + Pipeline"]:::aligned
        G4["Stable API Contracts"]:::boundary
        G1 --> G4
        G2 --> G4
        G3 --> G4
    end

    subgraph bad ["Misaligned: Layer Teams"]
        L1["Frontend Team\nAll UI across all domains"]:::misaligned
        L2["Backend Team\nAll logic across all domains"]:::misaligned
        L3["Database Team\nAll data across all domains"]:::misaligned
        L4["Coordinated Release Required"]:::boundary
        L1 --> L4
        L2 --> L4
        L3 --> L4
    end

How to Align Teams to Code

Step 1: Map who modifies what

Before changing anything, understand the actual ownership pattern. Use commit history to identify which teams (or individuals acting as de facto teams) modify which files and services.

Pull commit history for the last three months: git log --format="%ae %f" | sort | uniq -c
Map authors to their team. Identify the files each team touches most.
Highlight files that multiple teams touch frequently. These are the coupling points.
Identify services or modules where changes from one team consistently require changes from another.

The result is a map of actual ownership versus nominal ownership. In most organizations these diverge significantly.

Step 2: Identify natural domain boundaries

Natural domain boundaries exist in most codebases - they are just not enforced by team structure. Look for:

Business capabilities. What does this system do? Separate business functions - billing, shipping, authentication, reporting - that could be operated independently are candidate domains.
Data ownership. Which tables or data stores does each part of the system read and write? Data that is exclusively owned by one functional area belongs in that domain.
Rate of change. Code that changes weekly for business reasons and code that changes monthly for infrastructure reasons should be in different domains with different teams.
Existing team knowledge. Where do engineers already have strong concentrated expertise? Domain boundaries often match knowledge boundaries.

Draw a candidate domain map. Each domain should be a bounded set of business capability that one team can own end-to-end. Do not force domains to map to the current team structure - let the business capabilities define the boundaries first.

Step 3: Assign end-to-end ownership

For each candidate domain identified in Step 2, assign a single team. The rules:

One team per domain. Shared ownership produces neither ownership. If a domain has two owners, pick one.
Full stack. The owning team is responsible for all layers within the domain - UI, logic, data. If the current team lacks skills at some layer, plan for cross-training or re-staffing, but do not address the skill gap by keeping a separate layer team.
Deployment authority. The owning team merges to trunk and controls the deployment pipeline for their domain. No other team can block their deployment.
Operational accountability. The owning team is paged for production issues in their domain. On-call for the domain is owned by the same people who build it.

Document the domain boundaries explicitly: what services, data stores, and interfaces belong to each team.

Step 4: Define contracts at boundaries

Once teams own their domains, the interfaces between domains must be made explicit. Implicit interfaces - shared databases, undocumented internal calls, assumed response shapes - break independent deployment.

For each boundary between domains:

API contracts. Define the request and response shapes the consuming team depends on. Use OpenAPI or an equivalent schema. Commit it to the producer’s repository.
Event contracts. For asynchronous communication, define the event schema and the guarantees the producer makes (ordering, at-least-once vs. exactly-once, schema evolution rules).
Versioning. Establish a versioning policy. Additive changes are non-breaking. Removing or changing field semantics requires a new version. Both old and new versions are supported for a defined deprecation period.
Contract tests. Write tests that verify the producer honors the contract. Write tests that verify the consumer handles the contract correctly. See Contract Testing for implementation guidance.

Teams should not proceed to separate deployment pipelines until contracts are explicit and tested. An implicit contract that breaks silently is worse than a coordinated deployment.

Step 5: Separate deployment pipelines

With explicit contracts in place, each team can operate an independent pipeline for their domain.

Each team’s pipeline validates only their domain’s tests and contracts.
Pipeline triggers are scoped to the files the team owns - changes to another domain’s files do not trigger this team’s pipeline.
Each team deploys from their pipeline on their own schedule, without waiting for other teams.

For teams that share a repository but own distinct domains, use path-filtered triggers and separate pipeline configurations. See Multiple Teams, Single Deployable for a worked example of this pattern when teams share a modular monolith.

Objection	Response
“We don’t have enough senior engineers to staff every domain team fully.”	Domain teams do not need to be large. A team of two to three engineers with full ownership of a well-scoped domain delivers faster than six engineers on a layer team waiting for each other. Start with the highest-priority domains and staff others incrementally.
“Our engineers are specialists. The frontend people can’t own database code.”	Ownership does not require equal expertise at every layer - it requires the team to be responsible and to develop capability over time. Pair frontend specialists with backend engineers on the same team. The skill gap closes faster inside a team than across team boundaries.
“We tried domain teams before and they reinvented everything separately.”	Reinvention happens when platform capabilities are not shared effectively, not because of domain ownership. Separate domain ownership (what business capabilities each team is responsible for) from platform ownership (shared infrastructure, frameworks, and observability tooling).
“Business stakeholders are used to requesting work from the layer teams.”	Stakeholders adapt quickly when domain teams ship faster and with less coordination. Reframe the conversation: stakeholders talk to the team that owns the outcome, not the team that owns the layer.
“Our architecture doesn’t have clean domain boundaries yet.”	Start with the organizational change anyway. Teams aligned to emerging domain boundaries will drive the architectural cleanup faster than a centralized architecture effort without aligned ownership. The two reinforce each other.

Measuring Success

Metric	Target	Why It Matters
Deployment frequency per team	Increasing per team	Confirms teams can deploy without waiting for others
Cross-team dependencies per feature	Decreasing toward zero	Confirms domain boundaries are holding
Development cycle time	Decreasing	Teams that own their domain wait on fewer external dependencies
Production incidents attributed to another team’s change	Decreasing	Confirms ownership boundaries match deployment boundaries
Teams blocked on a release window they did not control	Decreasing toward zero	The primary organizational symptom of misalignment

Architecture Decoupling - the technical counterpart to team alignment; both must move together
Multiple Teams, Single Deployable - pipeline pattern for teams sharing a modular monolith before full service separation
Horizontal Slicing - the work decomposition anti-pattern that layer team structures encourage
Tightly Coupled Monolith - the architecture anti-pattern that misaligned team ownership produces
Thin Spread Teams - the organizational anti-pattern of distributing engineers too thin across too many services
Work Decomposition - how to slice work vertically within a team’s domain boundary
Contract Testing - how to define and enforce the contracts between domain teams

4.8 - Hypothesis-Driven Development

Treat every change as an experiment with a predicted outcome, measure the result, and adjust future work based on evidence.

Phase 3 - Optimize

Hypothesis-driven development treats every change as an experiment. Instead of building features because someone asked for them and hoping they help, teams state a predicted outcome before writing code, measure the result after deployment, and use the evidence to decide what to do next. Combined with feature flags, small batches, and metrics-driven improvement, this practice closes the loop between shipping and learning.

Why Hypothesis-Driven Development

Most teams ship features without stating what outcome they expect. A product manager requests a feature, developers build it, and everyone moves on to the next item. Weeks later, nobody checks whether the feature actually helped.

This is waste. Teams accumulate features without knowing their impact, backlogs grow based on opinion rather than evidence, and the product drifts in whatever direction the loudest voice demands.

Hypothesis-driven development fixes this by making every change answer a question. If the answer is “yes, it helped,” the team invests further. If the answer is “no,” the team reverts or pivots before sinking more effort into the wrong direction. Over time, this produces a product shaped by evidence rather than assumptions.

The Lifecycle

The hypothesis-driven development lifecycle has five stages. Each stage has a specific purpose and a clear output that feeds the next stage.

1. Form the Hypothesis

A hypothesis is a falsifiable prediction about what a change will accomplish. It follows a specific format:

“We believe [change] will produce [outcome] because [reason].”

The “because” clause is critical. Without it, you have a wish, not a hypothesis. The reason forces the team to articulate the causal model behind the change, which makes it possible to learn even when the experiment fails.

Good hypothesis vs. bad hypothesis

**Good:** "We believe adding a progress indicator to the checkout flow will reduce cart abandonment by 10% because users currently leave when they cannot tell how many steps remain." - Specific change (progress indicator in checkout) - Measurable outcome (10% reduction in cart abandonment) - Stated reason (users leave due to uncertainty about remaining steps) --- **Bad:** "We believe improving the checkout experience will increase conversions." - Vague change (what does "improving" mean?) - No target (how much increase?) - No reason (why would it increase conversions?)

Criteria for a testable hypothesis:

Criterion	Test	Example
Specific change	Can you describe exactly what will be different?	“Add a 3-step progress bar to the checkout page header”
Measurable outcome	Can you define a number that will move?	“Cart abandonment rate drops from 45% to 40%”
Time-bound	Do you know when to check?	“Measured over 2 weeks with at least 5,000 sessions”
Falsifiable	Is it possible for the experiment to fail?	Yes - abandonment could stay the same or increase
Connected to business value	Does the outcome matter to the business?	Reduced abandonment directly increases revenue

2. Design the Experiment

Once the hypothesis is formed, design an experiment that can confirm or reject it.

Scope the change to one variable. If you change the checkout layout and add a progress indicator and reduce the number of form fields at the same time, you cannot attribute the outcome to any single change. Change one thing at a time.

Define success and failure criteria before writing code. This prevents moving the goalposts after seeing the results. Write down what “success” looks like and what “failure” looks like before the first commit.

Experiment design template

**Hypothesis:** Adding a progress indicator will reduce cart abandonment by 10%. **Method:** A/B test - 50% of users see the progress indicator, 50% see the current checkout. **Success criteria:** Abandonment rate in the test group is at least 8% lower than control (allowing a 2% margin). **Failure criteria:** Abandonment rate difference is less than 5%, or the test group shows higher abandonment. **Sample size:** Minimum 5,000 sessions per group. **Time box:** 2 weeks or until sample size is reached, whichever comes first.

Choose the measurement method:

Method	When to Use	Tradeoff
A/B test	You have enough traffic to split users into groups	Most rigorous, but requires sufficient volume
Before/after	Low traffic or infrastructure changes that affect everyone	Simpler, but confounding factors are harder to control
Cohort comparison	Targeting a specific user segment	Good for segment-specific changes, harder to generalize

3. Implement and Deploy

Build the change using the same continuous delivery practices you use for any other work.

Use feature flags to control exposure. The feature flag infrastructure you built earlier in this phase is what makes experiments possible. Deploy the change behind a flag, then use the flag to control which users see the new behavior.

Deploy through the standard CD pipeline. Experiments are not special. They go through the same build, test, and deployment process as every other change. This ensures the experiment code meets the same quality bar as production code.

Keep the change small. A hypothesis-driven change should follow the same small batch discipline as any other work. If the experiment requires weeks of development, the scope is too large. Break it into smaller experiments that can each be measured independently.

Example implementation:

Feature flag controlling an A/B experiment

public class CheckoutController {

    private final FeatureFlagService flags;
    private final MetricsService metrics;

    public CheckoutController(FeatureFlagService flags, MetricsService metrics) {
        this.flags = flags;
        this.metrics = metrics;
    }

    public CheckoutPage renderCheckout(User user, Cart cart) {
        boolean showProgress = flags.isEnabled("experiment-checkout-progress", user);

        metrics.record("checkout-started", Map.of(
            "variant", showProgress ? "with-progress" : "control",
            "userId", user.getId()
        ));

        if (showProgress) {
            return new CheckoutPage(cart, new ProgressIndicator(3));
        }
        return new CheckoutPage(cart);
    }
}

4. Measure Results

After the time box expires or the sample size is reached, compare the results against the predefined success criteria.

Compare against your criteria, not against your hopes. If the success criterion was “8% reduction in abandonment” and you achieved 3%, that is a failure by your own definition, even if 3% sounds nice. Rigorous criteria prevent confirmation bias.

Account for confounding factors. Did a marketing campaign run during the experiment? Was there a holiday? Did another team ship a change that affects the same flow? Document anything that might have influenced the results.

Record the outcome regardless of success or failure. Failed experiments are as valuable as successful ones. They update the team’s understanding of how the product works and prevent repeating the same mistakes.

Experiment result record

**Hypothesis:** Progress indicator reduces cart abandonment by 10%. **Result:** Abandonment dropped 4% in the test group (not statistically significant at p < 0.05). **Verdict:** Failed - did not meet the 8% threshold. **Confounding factors:** A site-wide sale ran during week 2, which may have increased checkout motivation in both groups. **Learning:** Progress visibility alone is not sufficient to address abandonment. Exit survey data suggests price comparison (leaving to check competitors) is the primary driver, not checkout confusion. **Next action:** Design a new experiment targeting price confidence instead of checkout flow.

5. Adjust

The final stage closes the loop. Based on the results, the team takes one of three actions:

If validated: Remove the feature flag and make the change permanent. Update the product documentation. Feed the learning into the next hypothesis - what else could you improve now that this change is in place?

If invalidated: Revert the change by disabling the flag. Document what was learned and why the hypothesis was wrong. Use the learning to form a better hypothesis. Do not treat invalidation as failure - a team that never invalidates a hypothesis is not running real experiments.

If inconclusive: Decide whether to extend the experiment (more time, more traffic) or abandon it. If confounding factors were identified, consider rerunning the experiment under cleaner conditions. Set a hard limit on reruns to avoid indefinite experimentation.

Common Pitfalls

Pitfall	What Happens	How to Avoid It
No success criteria defined upfront	Team rationalizes any result as a win	Write success and failure criteria before the first commit
Changing multiple variables at once	Cannot attribute the outcome to any single change	Scope each experiment to one variable
Abandoning experiments too early	Insufficient data leads to wrong conclusions	Set a minimum sample size and time box; commit to both
Never invalidating a hypothesis	Experiments are performative, not real	Celebrate invalidations - they prevent wasted effort
Skipping the record step	Team repeats failed experiments or forgets what worked	Maintain an experiment log that is part of the team’s knowledge base
Hypothesis disconnected from business outcomes	Team optimizes technical metrics nobody cares about	Every hypothesis must connect to a metric the business tracks
Experiments that are too large	Weeks of development before any measurement	Apply small batch discipline to experiments too

Measuring Success

Indicator	Target	Why It Matters
Experiments completed per quarter	4 or more	Confirms the team is running experiments, not just shipping features
Percentage of experiments with predefined success criteria	100%	Confirms rigor - no experiment should start without criteria
Ratio of validated to invalidated hypotheses	Between 40-70% validated	Too high means hypotheses are not bold enough; too low means the team is guessing
Time from hypothesis to result	2-4 weeks	Confirms experiments are scoped small enough to get fast answers
Decisions changed by experiment results	Increasing	Confirms experiments actually influence product direction

Next Step

Experiments generate learnings, but learnings only turn into improvements when the team discusses them. Retrospectives provide the forum where the team reviews experiment results, decides what to do next, and adjusts the process itself.

Metrics-Driven Improvement - the measurement infrastructure that hypothesis-driven development depends on
Small Batches - the practice that keeps experiments small enough to measure
Feature Flags - the mechanism that controls experiment exposure
Retrospectives - where the team discusses experiment results and decides next steps
First-Class Artifacts - how ACD formalizes experiment artifacts for agent-assisted workflows
Agent-Assisted Specification - how agents can help generate and evaluate hypotheses

5 - Phase 4: Deliver on Demand

The capability to deploy any change to production at any time, using the delivery strategy that fits your context.

Key question: “Can we deliver any change to production when the business needs it?”

This is the destination: you can deploy any change that passes the pipeline to production whenever you choose. Some teams will auto-deploy every commit (continuous deployment). Others will deploy on demand when the business is ready. Both are valid - the capability is what matters, not the trigger.

What You’ll Do

Deploy on demand - Remove the last manual gates so any green build can reach production
Use progressive rollout - Canary, blue-green, and percentage-based deployments
Explore ACD - AI-assisted continuous delivery patterns
Learn from experience reports - How other teams made the journey

Continuous Delivery vs. Continuous Deployment

These terms are often confused. The distinction matters for this phase:

Continuous delivery means every commit that passes the pipeline could be deployed to production at any time. The capability exists. A human or business process decides when.
Continuous deployment means every commit that passes the pipeline is deployed to production automatically. No human decision is involved.

Continuous delivery is the goal of this migration guide. Continuous deployment is one delivery strategy that works well for certain contexts - SaaS products, internal tools, services behind feature flags. It is not a higher level of maturity. A team that deploys on demand with a one-click deploy is just as capable as a team that auto-deploys every commit.

Why This Phase Matters

When your foundations are solid, your pipeline is reliable, and your batch sizes are small, deploying any change becomes low-risk. The remaining barriers are organizational, not technical: approval processes, change windows, release coordination. This phase addresses those barriers so the team has the option to deploy whenever the business needs it.

Signs You’ve Arrived

Any commit that passes the pipeline can reach production within minutes
The team deploys frequently (daily or more) with no drama
Mean time to recovery is measured in minutes
The team has confidence that any deployment can be safely rolled back
New team members can deploy on their first day
The deployment strategy (on-demand or automatic) is a team choice, not a constraint

Phase 3: Optimize - the previous phase that establishes small batches, feature flags, and flow improvements
Fear of Deploying - a deployment symptom that this phase eliminates by making deployment routine and low-risk
Infrequent Releases - a symptom directly addressed by delivering on demand
DORA Recommended Practices - the research-backed capabilities that underpin delivery performance
Deployment Frequency - the primary metric that reflects delivery-on-demand capability
Mean Time to Repair - the recovery metric that progressive rollout and automated rollback improve

5.1 - Deploy on Demand

Remove the last manual gates and deploy every change that passes the pipeline.

Phase 4 - Deliver on Demand | Original content

Deploy on demand means that any change which passes the full automated pipeline can reach production without waiting for a human to press a button, open a ticket, or schedule a window. This page covers the prerequisites, the transition from continuous delivery to continuous deployment, and how to address the organizational concerns that are the real barriers.

Continuous Delivery vs. Continuous Deployment

These two terms are often confused. The distinction matters:

Continuous Delivery: Every commit that passes the pipeline could be deployed to production. A human decides when to deploy.
Continuous Deployment: Every commit that passes the pipeline is deployed to production. No human decision is required.

If you have completed Phases 1-3 of this migration, you have continuous delivery. This page is about removing that last manual decision and moving to continuous deployment.

Why Remove the Last Gate?

The manual deployment decision feels safe. It gives someone a chance to “eyeball” the change before it goes to production. In practice, it does the opposite.

The Problems with Manual Gates

Problem	Why It Happens	Impact
Batching	If deploys are manual, teams batch changes to reduce the number of deploy events	Larger batches increase risk and make rollback harder
Delay	Changes wait for someone to approve, which may take hours or days	Longer lead time, delayed feedback
False confidence	The approver cannot meaningfully review what the automated pipeline already tested	The gate provides the illusion of safety without actual safety
Bottleneck	One person or team becomes the deploy gatekeeper	Creates a single point of failure for the entire delivery flow
Deploy fear	Infrequent deploys mean each deploy is higher stakes	Teams become more cautious, batches get larger, risk increases

The Paradox of Manual Safety

The more you rely on manual deployment gates, the less safe your deployments become. This is because manual gates lead to batching, batching increases risk, and increased risk justifies more manual gates. It is a vicious cycle.

Continuous deployment breaks this cycle. Small, frequent, automated deployments are individually low-risk. If one fails, the blast radius is small and recovery is fast.

Prerequisites for Deploy on Demand

Before removing manual gates, verify that these conditions are met. Each one is covered in earlier phases of this migration.

Non-Negotiable Prerequisites

Prerequisite	What It Means	Where to Build It
Comprehensive automated tests	The test suite catches real defects, not just trivial cases	Testing Fundamentals
Fast, reliable pipeline	The pipeline completes in under 15 minutes and rarely fails for non-code reasons	Deterministic Pipeline
Automated rollback	You can roll back a bad deployment in minutes without manual intervention	Rollback
Feature flags	Incomplete features are hidden from users via flags, not deployment timing	Feature Flags
Small batch sizes	Each deployment contains 1-3 small changes, not dozens	Small Batches
Production-like environments	Test environments match production closely enough that test results are trustworthy	Production-Like Environments
Observability	You can detect production issues within minutes through monitoring and alerting	Metrics-Driven Improvement

Assessment: Are You Ready?

Answer these questions honestly:

When was the last time your pipeline caught a real bug? If the answer is “I don’t remember,” your test suite may not be trustworthy enough.
How long does a rollback take? If the answer is more than 15 minutes, automate it first.
Do deploys ever fail for non-code reasons? (Environment issues, credential problems, network flakiness.) If yes, stabilize your pipeline first.
Does the team trust the pipeline? If team members regularly say “let me check one more thing before we deploy,” trust is not there yet. Build it through retrospectives and transparent metrics.

The Transition: Three Approaches

Approach 1: Shadow Mode

Run continuous deployment alongside manual deployment. Every change that passes the pipeline is automatically deployed to a shadow production environment (or a canary group). A human still approves the “real” production deployment.

Duration: 2-4 weeks.

What you learn: How often the automated deployment would have been correct. If the answer is “every time” (or close to it), the manual gate is not adding value.

Transition: Once the team sees that the shadow deployments are consistently safe, remove the manual gate.

Approach 2: Opt-In per Team

Allow individual teams to adopt continuous deployment while others continue with manual gates. This works well in organizations with multiple teams at different maturity levels.

Duration: Ongoing. Teams opt in when they are ready.

What you learn: Which teams are ready and which need more foundation work. Early adopters demonstrate the pattern for the rest of the organization.

Transition: As more teams succeed, continuous deployment becomes the default. Remaining teams are supported in reaching readiness.

Approach 3: Direct Switchover

Remove the manual gate for all teams at once. This is appropriate when the organization has high confidence in its pipeline and all teams have completed Phases 1-3.

Duration: Immediate.

What you learn: Quickly reveals any hidden dependencies on the manual gate (e.g., deploy coordination between teams, configuration changes that ride along with deployments).

Transition: Be prepared to temporarily revert if unforeseen issues arise. Have a clear rollback plan for the process change itself.

Addressing Organizational Concerns

The technical prerequisites are usually met before the organizational ones. These are the conversations you will need to have.

“What about change management / ITIL?”

Change management frameworks like ITIL define a “standard change” category: a pre-approved, low-risk, well-understood change that does not require a Change Advisory Board (CAB) review. Continuous deployment changes qualify as standard changes because they are:

Small (one to a few commits)
Automated (same pipeline every time)
Reversible (automated rollback)
Well-tested (comprehensive automated tests)

Work with your change management team to classify pipeline-passing deployments as standard changes. This preserves the governance framework while removing the bottleneck.

“What about compliance and audit?”

Continuous deployment does not eliminate audit trails - it strengthens them. Every deployment is:

Traceable: Tied to a specific commit, which is tied to a specific story or ticket
Reproducible: The same pipeline produces the same result every time
Recorded: Pipeline logs capture every test that passed, every approval that was automated
Reversible: Rollback history shows when and why a deployment was reverted

Provide auditors with access to pipeline logs, deployment history, and the automated test suite. This is a more complete audit trail than a manual approval signature.

“What about database migrations?”

Database migrations require special care in continuous deployment because they cannot be rolled back as easily as code changes.

Rules for database migrations in CD:

Migrations must be backward-compatible. The previous version of the code must work with the new schema.
Use expand/contract pattern. First deploy the new column/table (expand). Then deploy the code that uses it. Then remove the old column/table (contract). Each step is a separate deployment.
Never drop a column in the same deployment that stops using it. There is always a window where both old and new code run simultaneously.
Test migrations in production-like environments before they reach production.

“What if we deploy a breaking change?”

This is why you have automated rollback and observability. The sequence is:

Deployment happens automatically
Monitoring detects an issue (error rate spike, latency increase, health check failure)
Automated rollback triggers (or on-call engineer triggers manual rollback)
The team investigates and fixes the issue
The fix goes through the pipeline and deploys automatically

The key insight: this sequence takes minutes with continuous deployment. With manual deployment on a weekly schedule, the same breaking change would take days to detect and fix.

After the Transition

What Changes for the Team

Before	After
“Are we deploying today?”	Deploys happen automatically, all the time
“Who’s doing the deploy?”	Nobody - the pipeline does it
“Can I get this into the next release?”	Every merge to trunk is the next release
“We need to coordinate the deploy with team X”	Teams deploy independently
“Let’s wait for the deploy window”	There are no deploy windows

What Stays the Same

Code review still happens (before merge to trunk)
Automated tests still run (in the pipeline)
Feature flags still control feature visibility (decoupling deploy from release)
Monitoring still catches issues (but now recovery is faster)
The team still owns its deployments (but the manual step is gone)

The First Week

The first week of continuous deployment will feel uncomfortable. This is normal. The team will instinctively want to “check” deployments that happen automatically. Resist the urge to add manual checks back. Instead:

Watch the monitoring dashboards more closely than usual
Have the team discuss each automatic deployment in standup for the first week
Celebrate the first deployment that goes out without anyone noticing - that is the goal

Key Pitfalls

1. “We adopted continuous deployment but kept the approval step ‘just in case’”

If the approval step exists, it will be used, and you have not actually adopted continuous deployment. Remove the gate completely. If something goes wrong, use rollback - do not use a pre-deployment gate.

2. “Our deploy cadence didn’t actually increase”

Continuous deployment only increases deploy frequency if the team is integrating to trunk frequently. If the team still merges weekly, they will deploy weekly - automatically, but still weekly. Revisit Trunk-Based Development and Small Batches.

3. “We have continuous deployment for the application but not the database/infrastructure”

Partial continuous deployment creates a split experience: application changes flow freely but infrastructure changes still require manual coordination. Extend the pipeline to cover infrastructure as code, database migrations, and configuration changes.

Measuring Success

Metric	Target	Why It Matters
Deployment frequency	Multiple per day	Confirms the pipeline is deploying every change
Lead time	< 1 hour from commit to production	Confirms no manual gates are adding delay
Manual interventions per deploy	Zero	Confirms the process is fully automated
Change failure rate	Stable or improving	Confirms automation is not introducing new failures
MTTR	< 15 minutes	Confirms automated rollback is working

Next Step

Continuous deployment deploys every change, but not every change needs to go to every user at once. Progressive Rollout strategies let you control who sees a change and how quickly it spreads.

Infrequent Releases - the primary symptom that deploy on demand resolves
Merge Freeze - a symptom caused by manual deployment gates that disappears with continuous deployment
Fear of Deploying - a cultural symptom that fades as automated deployments become routine
CAB Gates - an organizational anti-pattern that this guide addresses through standard change classification
Manual Deployments - the pipeline anti-pattern that deploy on demand eliminates
Deployment Frequency - the key metric for measuring deploy-on-demand adoption

5.2 - Progressive Rollout

Use canary, blue-green, and percentage-based deployments to reduce deployment risk.

Phase 4 - Deliver on Demand | Original content

Progressive rollout strategies let you deploy to production without deploying to all users simultaneously. By exposing changes to a small group first and expanding gradually, you catch problems before they affect your entire user base. This page covers the three major strategies, when to use each, and how to implement automated rollback.

Why Progressive Rollout?

Even with comprehensive tests, production-like environments, and small batch sizes, some issues only surface under real production traffic. Progressive rollout is the final safety layer: it limits the blast radius of any deployment by exposing the change to a small audience first.

This is not a replacement for testing. It is an addition. Your automated tests should catch the vast majority of issues. Progressive rollout catches the rest - the issues that depend on real user behavior, real data volumes, or real infrastructure conditions that cannot be fully replicated in test environments.

The Three Strategies

Strategy 1: Canary Deployment

A canary deployment routes a small percentage of production traffic to the new version while the majority continues to hit the old version. If the canary shows no problems, traffic is gradually shifted.

Canary deployment traffic split diagram

┌─────────────────┐
                   5%   │  New Version     │  ← Canary
                ┌──────►│  (v2)            │
                │       └─────────────────┘
  Traffic ──────┤
                │       ┌─────────────────┐
                └──────►│  Old Version     │  ← Stable
                  95%   │  (v1)            │
                        └─────────────────┘

How it works:

Deploy the new version alongside the old version
Route 1-5% of traffic to the new version
Compare key metrics (error rate, latency, business metrics) between canary and stable
If metrics are healthy, increase traffic to 25%, 50%, 100%
If metrics degrade, route all traffic back to the old version

When to use canary:

Changes that affect request handling (API changes, performance optimizations)
Changes where you want to compare metrics between old and new versions
Services with high traffic volume (you need enough canary traffic for statistical significance)

When canary is not ideal:

Changes that affect batch processing or background jobs (no “traffic” to route)
Very low traffic services (the canary may not get enough traffic to detect issues)
Database schema changes (both versions must work with the same schema)

Implementation options:

Infrastructure	How to Route Traffic
Kubernetes + service mesh (Istio, Linkerd)	Weighted routing rules in VirtualService
Load balancer (ALB, NGINX)	Weighted target groups
CDN (CloudFront, Fastly)	Origin routing rules
Application-level	Feature flag with percentage rollout

Strategy 2: Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic and the other (green) is idle or staging.

Blue-green deployment traffic switch diagram

BEFORE:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green]     (IDLE)

  DEPLOY:
    Traffic ──────► [Blue - v1] (ACTIVE)
                    [Green - v2] (DEPLOYING / SMOKE TESTING)

  SWITCH:
    Traffic ──────► [Green - v2] (ACTIVE)
                    [Blue - v1]  (STANDBY / ROLLBACK TARGET)

How it works:

Deploy the new version to the idle environment (green)
Run smoke tests against green to verify basic functionality
Switch the router/load balancer to point all traffic at green
Keep blue running as an instant rollback target
After a stability period, repurpose blue for the next deployment

When to use blue-green:

You need instant, complete rollback (switch the router back)
You want to test the deployment in a full production environment before routing traffic
Your infrastructure supports running two parallel environments cost-effectively

When blue-green is not ideal:

Stateful applications where both environments share mutable state
Database migrations (the new version’s schema must work for both environments during transition)
Cost-sensitive environments (maintaining two full production environments doubles infrastructure cost)

Rollback speed: Seconds. Switching the router back is the fastest rollback mechanism available.

Strategy 3: Percentage-Based Rollout

Percentage-based rollout gradually increases the number of users who see the new version. Unlike canary (which is traffic-based), percentage rollout is typically user-based - a specific user always sees the same version during the rollout period.

Percentage-based rollout schedule

Hour 0:   1% of users  → v2,  99% → v1
  Hour 2:   5% of users  → v2,  95% → v1
  Hour 8:  25% of users  → v2,  75% → v1
  Day 2:   50% of users  → v2,  50% → v1
  Day 3:  100% of users  → v2

How it works:

Enable the new version for a small percentage of users (using feature flags or infrastructure routing)
Monitor metrics for the affected group
Gradually increase the percentage over hours or days
At any point, reduce the percentage back to 0% if issues are detected

When to use percentage rollout:

User-facing feature changes where you want consistent user experience (a user always sees v1 or v2, not a random mix)
Changes that benefit from A/B testing data (compare user behavior between groups)
Long-running rollouts where you want to collect business metrics before full exposure

When percentage rollout is not ideal:

Backend infrastructure changes with no user-visible impact
Changes that affect all users equally (e.g., API response format changes)

Implementation: Percentage rollout is typically implemented through Feature Flags (Level 2 or Level 3), using the user ID as the hash key to ensure consistent assignment.

Choosing the Right Strategy

Factor	Canary	Blue-Green	Percentage
Rollback speed	Seconds (reroute traffic)	Seconds (switch environments)	Seconds (disable flag)
Infrastructure cost	Low (runs alongside existing)	High (two full environments)	Low (same infrastructure)
Metric comparison	Strong (side-by-side comparison)	Weak (before/after only)	Strong (group comparison)
User consistency	No (each request may hit different version)	Yes (all users see same version)	Yes (each user sees consistent version)
Complexity	Moderate	Moderate	Low (if you have feature flags)
Best for	API changes, performance changes	Full environment validation	User-facing features

Many teams use more than one strategy. A common pattern:

Blue-green for infrastructure and platform changes
Canary for service-level changes
Percentage rollout for user-facing feature changes

Automated Rollback

Progressive rollout is only effective if rollback is automated. A human noticing a problem at 3 AM is not a reliable rollback mechanism.

Metrics to Monitor

Define automated rollback triggers before deploying. Common triggers:

Metric	Trigger Condition	Example
Error rate	Canary error rate > 2x stable error rate	Stable: 0.1%, Canary: 0.3% -> rollback
Latency (p99)	Canary p99 > 1.5x stable p99	Stable: 200ms, Canary: 400ms -> rollback
Health check	Any health check failure	HTTP 500 on /health -> rollback
Business metric	Conversion rate drops > 5% for canary group	10% conversion -> 4% conversion -> rollback
Saturation	CPU or memory exceeds threshold	CPU > 90% for 5 minutes -> rollback

Automated Rollback Flow

Automated rollback flow diagram

Deploy new version
       │
       ▼
Route 5% of traffic to new version
       │
       ▼
Monitor for 15 minutes
       │
       ├── Metrics healthy ──────► Increase to 25%
       │                                │
       │                                ▼
       │                          Monitor for 30 minutes
       │                                │
       │                                ├── Metrics healthy ──────► Increase to 100%
       │                                │
       │                                └── Metrics degraded ─────► ROLLBACK
       │
       └── Metrics degraded ─────► ROLLBACK

Implementation Tools

Tool	How It Helps
Argo Rollouts	Kubernetes-native progressive delivery with automated analysis and rollback
Flagger	Progressive delivery operator for Kubernetes with Istio, Linkerd, or App Mesh
Spinnaker	Multi-cloud deployment platform with canary analysis
Custom scripts	Query your metrics system, compare thresholds, trigger rollback via API

The specific tool matters less than the principle: define rollback criteria before deploying, monitor automatically, and roll back without human intervention.

Implementing Progressive Rollout

Step 1: Choose Your First Strategy

Pick the strategy that matches your infrastructure:

If you already have feature flags: start with percentage-based rollout
If you have Kubernetes with a service mesh: start with canary
If you have parallel environments: start with blue-green

Step 2: Define Rollback Criteria

Before your first progressive deployment:

Identify the 3-5 metrics that define “healthy” for your service
Define numerical thresholds for each metric
Define the monitoring window (how long to wait before advancing)
Document the rollback procedure (even if automated, document it for human understanding)

Step 3: Run a Manual Progressive Rollout

Before automating, run the process manually:

Deploy to a canary or small percentage
A team member monitors the dashboard for the defined window
The team member decides to advance or rollback
Document what they checked and how they decided

This manual practice builds understanding of what the automation will do.

Step 4: Automate the Rollout

Replace the manual monitoring with automated checks:

Implement metric queries that check your rollback criteria
Implement automated traffic shifting (advance or rollback based on metrics)
Implement alerting so the team knows when a rollback occurs
Test the automation by intentionally deploying a known-bad change (in a controlled way)

Key Pitfalls

1. “Our canary doesn’t get enough traffic for meaningful metrics”

If your service handles 100 requests per hour, a 5% canary gets 5 requests per hour - not enough to detect problems statistically. Solutions: use a higher canary percentage (25-50%), use longer monitoring windows, or use blue-green instead (which does not require traffic splitting).

2. “We have progressive rollout but rollback is still manual”

Progressive rollout without automated rollback is half a solution. If the canary shows problems at 2 AM and nobody is watching, the damage occurs before anyone responds. Automated rollback is the essential companion to progressive rollout.

3. “We treat progressive rollout as a replacement for testing”

Progressive rollout is the last line of defense, not the first. If you are regularly catching bugs in canary that your test suite should have caught, your test suite needs improvement. Progressive rollout should catch rare, production-specific issues - not common bugs.

4. “Our rollout takes days because we’re too cautious”

A rollout that takes a week negates the benefits of continuous deployment. If your confidence in the pipeline is low enough to require a week-long rollout, the issue is pipeline quality, not rollout speed. Address the root cause through better testing and more production-like environments.

Measuring Success

Metric	Target	Why It Matters
Automated rollbacks per month	Low and stable	Confirms the pipeline catches most issues before production
Time from deploy to full rollout	Hours, not days	Confirms the team has confidence in the process
Incidents caught by progressive rollout	Tracked (any number)	Confirms the progressive rollout is providing value
Manual interventions during rollout	Zero	Confirms the process is fully automated

Next Step

With deploy on demand and progressive rollout, your technical deployment infrastructure is complete. ACD explores how AI-assisted patterns can extend these practices further.

Fear of Deploying - a symptom that progressive rollout eliminates by limiting blast radius
Production Issues Found by Customers - a visibility problem that automated canary analysis helps detect before users are affected
Staging Passes, Production Fails - a symptom that progressive rollout mitigates by catching production-specific issues early
Feature Flags - the foundation for percentage-based rollout strategies
Blind Operations - an anti-pattern that must be resolved before automated rollback can work
Change Failure Rate - the metric that progressive rollout helps keep low by catching issues before full exposure

5.3 - Experience Reports

Real-world stories from teams that have made the journey to continuous deployment.

Phase 4 - Deliver on Demand

Theory is necessary but insufficient. This page collects experience reports from organizations that have adopted continuous deployment at scale, including the challenges they faced, the approaches they took, and the results they achieved. These reports demonstrate that CD is not limited to startups or greenfield projects - it works in large, complex, regulated environments.

Why Experience Reports Matter

Every team considering continuous deployment faces the same objection: “That works for [Google / Netflix / small startups], but our situation is different.” Experience reports counter this objection with evidence. They show that organizations of every size, in every industry, with every kind of legacy system, have found a path to continuous deployment.

No experience report will match your situation exactly. That is not the point. The point is to extract patterns: what obstacles did these teams encounter, and how did they overcome them?

Walmart: CD at Retail Scale

Context

Walmart operates one of the world’s largest e-commerce platforms alongside its massive physical retail infrastructure. Changes to the platform affect millions of transactions per day. The organization had a traditional release process with weekly deployment windows and multi-stage manual approval.

The Challenge

Scale: Thousands of developers across hundreds of teams
Risk tolerance: Any outage affects revenue in real time
Legacy: Decades of existing systems with deep interdependencies
Regulation: PCI compliance requirements for payment processing

What They Did

Invested in a centralized deployment platform (OneOps, later Concord) that standardized the deployment pipeline across all teams
Broke the monolithic release into independent service deployments
Implemented automated canary analysis for every deployment
Moved from weekly release trains to on-demand deployment per team

Key Lessons

Platform investment pays off. Building a shared deployment platform let hundreds of teams adopt CD without each team solving the same infrastructure problems.
Compliance and CD are compatible. Automated pipelines with full audit trails satisfied PCI requirements more reliably than manual approval processes.
Cultural change is harder than technical change. Teams that had operated on weekly release cycles for years needed coaching and support to trust automated deployment.

Microsoft: From Waterfall to Daily Deploys

Context

Microsoft’s Azure DevOps (formerly Visual Studio Team Services) team made a widely documented transformation from 3-year waterfall releases to deploying multiple times per day. This transformation happened within one of the largest software organizations in the world.

The Challenge

History: Decades of waterfall development culture
Product complexity: A platform used by millions of developers
Organizational size: Thousands of engineers across multiple time zones
Customer expectations: Enterprise customers expected stability and predictability

What They Did

Broke the product into independently deployable services (ring-based deployment)
Implemented a ring-based rollout: Ring 0 (team), Ring 1 (internal Microsoft users), Ring 2 (select external users), Ring 3 (all users)
Invested heavily in automated testing, achieving thousands of tests running in minutes
Moved from a fixed release cadence to continuous deployment with feature flags controlling release
Used telemetry to detect issues in real-time and automated rollback when metrics degraded

Key Lessons

Ring-based deployment is progressive rollout. Microsoft’s ring model is an implementation of the progressive rollout strategies described in this guide.
Feature flags enabled decoupling. By deploying frequently but releasing features incrementally via flags, the team could deploy without worrying about feature completeness.
The transformation took years, not months. Moving from 3-year cycles to daily deployment was a multi-year journey with incremental progress at each step.

Google: Engineering Productivity at Scale

Context

Google is often cited as the canonical example of continuous deployment, deploying changes to production thousands of times per day across its vast service portfolio.

The Challenge

Scale: Billions of users, millions of servers
Monorepo: Most of Google operates from a single repository with billions of lines of code
Interdependencies: Changes in shared libraries can affect thousands of services
Velocity: Thousands of engineers committing changes every day

What They Did

Built a culture of automated testing where tests are a first-class deliverable, not an afterthought
Implemented a submit queue that runs automated tests on every change before it merges to the trunk
Invested in build infrastructure (Blaze/Bazel) that can build and test only the affected portions of the codebase
Used percentage-based rollout for user-facing changes
Made rollback a one-click operation available to every team

Key Lessons

Test infrastructure is critical infrastructure. Google’s ability to deploy frequently depends entirely on its ability to test quickly and reliably.
Monorepo and CD are compatible. The common assumption that CD requires microservices with separate repos is false. Google deploys from a monorepo.
Invest in tooling before process. Google built the tooling (build systems, test infrastructure, deployment automation) that made good practices the path of least resistance.

Amazon: Two-Pizza Teams and Ownership

Context

Amazon’s transformation to service-oriented architecture and team ownership is one of the most influential in the industry. The “two-pizza team” model and “you build it, you run it” philosophy directly enabled continuous deployment.

The Challenge

Organizational size: Hundreds of thousands of employees
System complexity: Thousands of services powering amazon.com and AWS
Availability requirements: Even brief outages are front-page news
Pace of innovation: Competitive pressure demands rapid feature delivery

What They Did

Decomposed the system into independently deployable services, each owned by a small team
Gave teams full ownership: build, test, deploy, operate, and support
Built internal deployment tooling (Apollo) that automates canary analysis, rollback, and one-click deployment
Established the practice of deploying every commit that passes the pipeline, with automated rollback on metric degradation

Key Lessons

Ownership drives quality. When the team that writes the code also operates it in production, they write better code and build better monitoring.
Small teams move faster. Two-pizza teams (6-10 people) can make decisions without bureaucratic overhead.
Automation eliminates toil. Amazon’s internal deployment tooling means that deploying is not a skilled activity - any team member can deploy (and the pipeline usually deploys automatically).

HP: CD in Hardware-Adjacent Software

Context

HP’s LaserJet firmware team demonstrated that continuous delivery principles apply even to embedded software, a domain often considered incompatible with frequent deployment.

The Challenge

Embedded software: Firmware that runs on physical printers
Long development cycles: Firmware releases had traditionally been annual
Quality requirements: Firmware bugs require physical recalls or complex update procedures
Team size: Large, distributed teams with varying skill levels

What They Did

Invested in automated testing infrastructure for firmware
Reduced build times from days to under an hour
Moved from annual releases to frequent incremental updates
Implemented continuous integration with automated test suites running on simulator and hardware

Key Lessons

CD principles are universal. Even embedded firmware can benefit from small batches, automated testing, and continuous integration.
Build time is a critical constraint. Reducing build time from days to under an hour unlocked the ability to test frequently, which enabled frequent integration, which enabled frequent delivery.
Results were dramatic: Development costs reduced by approximately 40%, programs delivered on schedule increased by roughly 140%.

Flickr: “10+ Deploys Per Day”

Context

Flickr’s 2009 presentation “10+ Deploys Per Day: Dev and Ops Cooperation” is credited with helping launch the DevOps movement. At a time when most organizations deployed quarterly, Flickr was deploying more than ten times per day.

The Challenge

Web-scale service: Serving billions of photos to millions of users
Ops/Dev divide: Traditional separation between development and operations teams
Fear of change: Deployments were infrequent because they were risky

What They Did

Built automated infrastructure provisioning and deployment
Implemented feature flags to decouple deployment from release
Created a culture of shared responsibility between development and operations
Made deployment a routine, low-ceremony event that anyone could trigger
Used IRC bots (and later chat-based tools) to coordinate and log deployments

Key Lessons

Culture is the enabler. Flickr’s technical practices were important, but the cultural shift - developers and operations working together, shared responsibility, mutual respect - was what made frequent deployment possible.
Tooling should reduce friction. Flickr’s deployment tools were designed to make deploying as easy as possible. The easier it is to deploy, the more often people deploy, and the smaller each deployment becomes.
Transparency builds trust. Logging every deployment in a shared channel let everyone see what was deploying, who deployed it, and whether it caused problems. This transparency built organizational trust in frequent deployment.

VXS: “CD: Superhuman Efforts are the New Normal”

Context

VXS Decision is a startup like thousands of others: founder-led vision, under-funded, time crunch, resource crunch, but when targeting Enterprise customers: How do you deliver reliable, Enterprise-grade software without the resources of an Enterprise? This led to the discovery of the framework of principles and patterns now formulated as “Agentic CD.”

The Challenge

produce demoware or build to use?
fast output leads to structural inconsistency
architectural drift
how and what to document?
keeping the codebase maintainable

What They Did

Experimented with LLM for code generation
Applied rigorous CD practices to the work with AI agents
Mandated additional first-class artifacts in the repo
Standardized the approach of working with AI agents
Crunched Agentic CD pipeline cycles to deliver entire features in hours

Key Lessons

Agents Drift. Documentation on top of the codebases provides containment for inconsistency and duplication.
You need to extend your definition of ‘deliverable’. Code must not merely exist and pass the tests, it must be consistent with documented architecture and descriptions.
First-class artifacts are the true product. These include intent, behaviour, design, and decisions. With these, an LLM can reconstruct the product even without having access to the code itself.
You need a third folder in your repo. Where formally, /src and /test did the entire work, the /docs folder becomes your lifeline.

Agentic CD Additions

Additional practices required for LLM-assisted development:

Intent-first workflow. Anchor the implementation with a proper intent statement: what, why, for whom.
Delta & overlap analysis. Agents can compare new features against the existing system, detect redundancy, conflict, structural drift. The most interesting question becomes: “How does this relate to what we currently do?”
Structured documentation layers. User guides, feature descriptions, architectural decision records (ADRs) and system structure documentation become the glue of your system.
Human In the Loop. Key artifacts can be generated by Agents, but HITL is necessary to capture drift. Intent and decisions are human territory, behaviour and design must be actively guided by humans.
The docs are for the machine, not for humans. Documentation artifacts must be structured to guide Agents in implementation with minimal context windows, not to “read nicely” for humans.
- ASCII art beats photos, illustrations or doodles.
- Short paragraphs, no filler words. Consistent language.
- Optimize documentation to reference paragraphs to the Agents quickly and effectively.
- Cross-reference documents to reduce Agentic search efforts.

Outcomes

Delivery Speed measured in end-to-end cycle time:
- less than 1 hour for small changes and roughly 1 day for a large feature set
- sustained 10x-30x increase in development throughput, consistent over months
Quality: Every feature ships with: documentation, test coverage, linting, security review, architectural consistency, avoiding typical “AI slop” patterns
Operational Confidence boosted by ensuring every change is integrated, validated, reproducible, and deployable from a technical, organizational and product perspective alike.
Team Scalability:
- approach teachable to new joiners within days
- getting the startup out of the “resource pickle.”

Key Lessons

LLMs without CD discipline create entropy: speed without structure degrades system integrity
Agentic CD principles are scale-independent: the same patterns apply in a startup as in an enterprise. The startup even benefits more, because it can scale/pivot within hours.
Agentic development requires additional artifacts: those documents you thought you can skip to speed things up? They become your product!
The bottleneck moves from typing code to maintaining coherence: You will be investing more time keeping your first-class documents correct and consistent than into writing code. Referencing the right document sections becomes your steering panel.

The VXS Journey to Discover Agentic CD

In 2023, early experiments with LLM-generated code looked promising but quickly broke down in practice. The models produced working code, but integration was tedious, structure drifted, and quality was inconsistent. Available tooling accelerated output but also amplified architectural chaos. Attempts to adopt community conventions created additional noise and documentation bloat rather than clarity. The result was a clear pattern: without structure, AI increases speed but destroys coherence.

The breakthrough came from systematically applying Continuous Delivery principles directly to agentic development. Every feature began with an explicit intent, aligned against existing system structure, documented, tested, and only then implemented. Documentation, ADRs, and tests became first-class artifacts in the repository, acting as control surfaces for the AI. With a single pipeline and strict definition of “deployable,” the system stabilized. The outcome was sustained 10x-30x delivery performance with consistent quality. This showed that Continuous Delivery is not dependent on scale or large platform teams - its principles hold even in a startup using agentic development.

Common Patterns Across Reports

Despite the diversity of these organizations, several patterns emerge consistently:

1. Investment in Automation Precedes Cultural Change

Every organization built the tooling first. Automated testing, automated deployment, automated rollback - these created the conditions where frequent deployment was possible. Cultural change followed when people saw that the automation worked.

2. Incremental Adoption, Not Big Bang

No organization switched to continuous deployment overnight. They all moved incrementally: shorter release cycles first, then weekly deploys, then daily, then on-demand. Each step built confidence for the next.

3. Team Ownership Is Essential

Organizations that gave teams ownership of their deployments (build it, run it) moved faster than those that kept deployment as a centralized function. Ownership creates accountability, which drives quality.

4. Feature Flags Are Universal

Every organization in these reports uses feature flags to decouple deployment from release. This is not optional for continuous deployment - it is foundational.

5. The Results Are Consistent

Regardless of industry, size, or starting point, organizations that adopt continuous deployment consistently report:

Higher deployment frequency (daily or more)
Lower change failure rate (small changes fail less)
Faster recovery (automated rollback, small blast radius)
Higher developer satisfaction (less toil, more impact)
Better business outcomes (faster time to market, reduced costs)

Applying These Lessons to Your Migration

You do not need to be Google-sized to benefit from these patterns. Extract what applies:

Start with automation. Build the pipeline, the tests, the rollback mechanism.
Adopt incrementally. Move from monthly to weekly to daily. Do not try to jump to 10 deploys per day on day one.
Give teams ownership. Let teams deploy their own services.
Use feature flags. Decouple deployment from release.
Measure and improve. Track DORA metrics. Run experiments. Use retrospectives.

These are the practices covered throughout this migration guide. The experience reports confirm that they work - not in theory, but in production, at scale, in the real world.

Additional Experience Reports

These reports did not fit neatly into the case studies above but provide valuable perspectives:

Ken Mugrage on trunk-based development as part of modern Continuous Delivery - A practitioner’s view of how TBD enables CD in practice
Integrating Security Feedback into a BDD-Driven Minimum CD Pipeline - A detailed walk-through of building a CD pipeline with security testing integrated from the start

6 - Migrating Brownfield to CD

Already have a running system? A phased approach to migrating existing applications and teams to continuous delivery.

Most teams adopting CD are not starting from scratch. They have existing codebases, existing processes, existing habits, and existing pain. This section provides the phased migration path from where you are today to continuous delivery, without stopping feature delivery along the way.

The Reality of Brownfield Migration

Migrating an existing system to CD is harder than building CD into a greenfield project. You are working against inertia: existing branching strategies, existing test suites (or lack thereof), existing deployment processes, and existing team habits. Every change has to be made incrementally, alongside regular delivery work.

The good news: every team that has successfully adopted CD has done it this way. The practices in this guide are designed for incremental adoption, not big-bang transformation.

The Migration Phases

The migration is organized into five phases. Each phase builds on the previous one. Start with Phase 0 to understand where you are, then work through the phases in order.

Phase	Name	Goal	Key Question
0	Assess	Understand where you are	“How far are we from CD?”
1	Foundations	Daily integration, testing, small work	“Can we integrate safely every day?”
2	Pipeline	Automated path to production	“Can we deploy any commit automatically?”
3	Optimize	Improve flow, reduce batch size	“Can we deliver small changes quickly?”
4	Deliver on Demand	Deploy any change when needed	“Can we deliver any change to production when needed?”

Where to Start

If you don’t know where you stand

Start with Phase 0 - Assess. Complete the value stream mapping exercise, take baseline metrics, and fill out the current-state checklist. These activities tell you exactly where you stand and which phase to begin with.

If you know your biggest pain point

Start with Anti-Patterns. Find the problem your team feels most, and follow the links to the practices and migration phases that address it.

Quick self-assessment

If you don’t have time for a full assessment, answer these questions:

Do all developers integrate to trunk at least daily? If no, start with Phase 1.
Do you have a single automated pipeline that every change goes through? If no, start with Phase 2.
Can you deploy any green build to production on demand? If no, focus on the gap between your current state and Phase 2 completion criteria.
Do you deploy at least weekly? If no, look at Phase 3 for batch size and flow optimization.

Principles for Brownfield Migration

Do not stop delivering features

The migration is done alongside regular delivery work, not instead of it. Each practice is adopted incrementally. You do not stop the world to rewrite your test suite or redesign your pipeline.

Fix the biggest constraint first

Use your value stream map and metrics to identify which blocker is the current constraint. Fix that one thing. Then find the next constraint and fix that. Do not try to fix everything at once.

See Identify Constraints and the CD Dependency Tree.

Make progress visible

Track your DORA metrics from day one: deployment frequency, lead time for changes, change failure rate, and mean time to restore. These metrics show whether your changes are working and build the case for continued investment.

See Baseline Metrics.

Start with one team

CD adoption works best when a single team can experiment, learn, and iterate without waiting for organizational consensus. Once one team demonstrates results, other teams have a concrete example to follow.

Common Brownfield Challenges

These challenges are specific to migrating existing systems. For the full catalog of problems teams face, see Anti-Patterns.

Challenge	Why it’s hard	Approach
Large codebase with no tests	Writing tests retroactively is expensive and the ROI feels unclear	Do not try to add tests to the whole codebase. Add tests to every file you touch. Use the test-for-every-bug-fix rule. Coverage grows where it matters most.
Long-lived feature branches	The team has been using feature branches for years and the workflow feels safe	Reduce branch lifetime gradually: from two weeks to one week to two days to same-day. Do not switch to trunk overnight.
Manual deployment process	The “deployment expert” has a 50-step runbook in their head	Document the manual process first. Then automate one step at a time, starting with the most error-prone step.
Flaky test suite	Tests that randomly fail have trained the team to ignore failures	Quarantine all flaky tests immediately. They do not block the build until they are fixed. Zero tolerance for new flaky tests.
Tightly coupled architecture	Changing one module breaks others unpredictably	You do not need microservices. You need clear boundaries. Start by identifying and enforcing module boundaries within the monolith.
Organizational resistance	“We’ve always done it this way”	Start small, show results, build the case with data. One team deploying daily with lower failure rates is more persuasive than any slide deck.

Anti-Patterns - Start with the problem you feel most
Phase 0 - Assess - Understand your current state

6.1 - Document Your Current Process

Before formal value stream mapping, get the team to write down every step from “ready to push” to “running in production.” Quick wins surface immediately; the documented process becomes better input for the value stream mapping session.

The Brownfield CD overview covers the migration phases, principles, and common challenges. This page covers the first practical step - documenting what actually happens today between a developer finishing a change and that change running in production.

Why Document Before Mapping

Value stream mapping is a powerful tool for systemic improvement. It requires measurement, cross-team coordination, and careful analysis. That takes time to do well, and it should not be rushed.

But you do not need a value stream map to spot obvious friction. Manual steps that could be automated, wait times caused by batching, handoffs that exist only because of process - these are visible the moment you write the process down.

Document your current process first. This gives you two things:

Quick wins you can fix this week. Obvious waste that requires no measurement or cross-team coordination to remove.
Better input for value stream mapping. When you do the formal mapping session, the team is not starting from a blank whiteboard. They have a shared, written description of what actually happens, and they have already removed the most obvious friction.

Quick wins build momentum. Teams that see immediate improvements are more willing to invest in the deeper systemic work that value stream mapping reveals.

How to Do It

Get the team together. Pick a recent change that went through the full process from “ready to push” to “running in production.” Walk through every step that happened, in order.

The rules:

Document what actually happens, not what should happen. If the official process says “automated deployment” but someone actually SSH-es into a server and runs a script, write down the SSH step.
Include the invisible steps. The Slack message asking for review. The email requesting deploy approval. The wait for the Tuesday deploy window. These are often the biggest sources of delay and they are usually missing from official process documentation.
Get the whole team in the room. Different people see different parts of the process. The developer who writes the code may not know what happens after the merge. The ops person who runs the deploy may not know about the QA handoff. You need every perspective.
Write it down as an ordered list. Not a flowchart, not a diagram, not a wiki page with sections. A simple numbered list of steps in the order they actually happen.

What to Capture for Each Step

For every step in the process, capture these details:

Field	What to Write	Example
Step name	What happens, in plain language	“QA runs manual regression tests”
Who does it	Person or role responsible	“QA engineer on rotation”
Manual or automated	Is this step done by a human or by a tool?	“Manual”
Typical duration	How long the step itself takes	“4 hours”
Wait time before it starts	How long the change sits before this step begins	“1-2 days (waits for QA availability)”
What can go wrong	Common failure modes for this step	“Tests find a bug, change goes back to dev”

The wait time column is usually more revealing than the duration column. A deploy that takes 10 minutes but only happens on Tuesdays has up to 7 days of wait time. The step itself is not the problem - the batching is.

Example: A Typical Brownfield Process

This is a realistic example of what a brownfield team’s process might look like before any CD practices are adopted. Your process will differ, but the pattern of manual steps and wait times is common.

#	Step	Who	Manual/Auto	Duration	Wait Before	What Can Go Wrong
1	Push to feature branch	Developer	Manual	Minutes	None	Merge conflicts with other branches
2	Open pull request	Developer	Manual	10 min	None	Forgot to update tests
3	Wait for code review	Developer (waiting)	Manual	-	4 hours to 2 days	Reviewer is busy, PR sits
4	Address review feedback	Developer	Manual	30 min to 2 hours	-	Multiple rounds of feedback
5	Merge to main branch	Developer	Manual	Minutes	-	Merge conflicts from stale branch
6	CI runs (build + unit tests)	CI server	Automated	15 min	Minutes	Flaky tests cause false failures
7	QA picks up ticket from board	QA engineer	Manual	-	1-3 days	QA backlog, other priorities
8	Manual functional testing	QA engineer	Manual	2-4 hours	-	Finds bug, sends back to dev
9	Request deploy approval	Team lead	Manual	5 min	-	Approver is on vacation
10	Wait for deploy window	Everyone (waiting)	-	-	1-7 days (deploys on Tuesdays)	Window missed, wait another week
11	Ops runs deployment	Ops engineer	Manual	30 min	-	Script fails, manual rollback
12	Smoke test in production	Ops engineer	Manual	15 min	-	Finds issue, emergency rollback

Total typical time: 3 to 14 days from “ready to push” to “running in production.”

Even before measurement or analysis, patterns jump out:

Steps 3, 7, and 10 are pure wait time - nothing is happening to the change.
Steps 8 and 12 are manual testing that could potentially be automated.
Step 10 is artificial batching - deploys happen on a schedule, not on demand.
Step 9 might be a rubber-stamp approval that adds delay without adding safety.

Spotting Quick Wins

Once the process is documented, look for these patterns. Each one is a potential quick win that the team can fix without a formal improvement initiative.

Automation targets

Steps that are purely manual but have well-known automation:

Code formatting and linting. If reviewers spend time on style issues, add a linter to CI. This saves reviewer time on every single PR.
Running tests. If someone manually runs tests before merging, make CI run them automatically on every push.
Build and package. If someone manually builds artifacts, automate the build in the pipeline.
Smoke tests. If someone manually clicks through the app after deploy, write a small set of automated smoke tests.

Batching delays

Steps where changes wait for a scheduled event:

Deploy windows. “We deploy on Tuesdays” means every change waits an average of 3.5 days. Moving to deploy-on-demand (even if still manual) removes this wait entirely.
QA batches. “QA tests the release candidate” means changes queue up. Testing each change as it merges removes the batch.
CAB meetings. “The change advisory board meets on Thursdays” adds up to a week of wait time per change.

Process-only handoffs

Steps where work moves between people not because of a skill requirement, but because of process:

QA sign-off that is a rubber stamp. If QA always approves and never finds issues, the sign-off is not adding value.
Approval steps that are never rejected. Track the rejection rate. If an approval step has a 0% rejection rate over the last 6 months, it is ceremony, not a gate.
Handoffs between people who sit next to each other. If the developer could do the step themselves but “process says” someone else has to, question the process.

Unnecessary steps

Steps that exist because of historical reasons and no longer serve a purpose:

Manual steps that duplicate automated checks. If CI runs the tests and someone also runs them manually “just to be sure,” the manual run is waste.
Approvals for low-risk changes. Not every change needs the same level of scrutiny. A typo fix in documentation does not need a CAB review.

Quick Wins vs. Value Stream Improvements

Not everything you find in the documented process is a quick win. Distinguish between the two:

	Quick Wins	Value Stream Improvements
Scope	Single team can fix	Requires cross-team coordination
Timeline	Days to a week	Weeks to months
Measurement	Obvious before/after	Requires baseline metrics and tracking
Risk	Low - small, reversible changes	Higher - systemic process changes
Examples	Add linter to CI, remove rubber-stamp approval, enable on-demand deploys	Restructure testing strategy, redesign deployment pipeline, change team topology

Do the quick wins now. Do not wait for the value stream mapping session. Every manual step you remove this week is one less step cluttering the value stream map and one less source of friction for the team.

Bring the documented process to the value stream mapping session. The team has already aligned on what actually happens, removed the obvious waste, and built some momentum. The value stream mapping session can focus on the systemic issues that require measurement, cross-team coordination, and deeper analysis.

What Comes Next

Fix the quick wins. Assign each one to someone with a target of this week or next week. Do not create a backlog of improvements that sits untouched.
Schedule the value stream mapping session. Use the documented process as the starting point. See Value Stream Mapping.
Start the replacement cycle. For manual validations that are not quick wins, use the Replacing Manual Validations cycle to systematically automate and remove them.

Value Stream Mapping - The formal analysis tool for systemic improvements
Replacing Manual Validations - The cycle for automating and removing manual steps
Identify Constraints - Prioritize which bottleneck to fix first
Baseline Metrics - Measure your starting point before making changes

6.2 - Replacing Manual Validations with Automation

The repeating mechanical cycle at the heart of every brownfield CD migration: identify a manual validation, automate it, prove the automation works, and remove the manual step.

The Brownfield CD overview covers the migration phases, principles, and common challenges. This page covers the core mechanical process - the specific, repeating cycle of replacing manual validations with automation that drives every phase forward.

The Replacement Cycle

Every brownfield CD migration follows the same four-step cycle, repeated until no manual validations remain between commit and production:

Identify a manual validation in the delivery process.
Automate the check so it runs in the pipeline without human intervention.
Validate that the automation catches the same problems the manual step caught.
Remove the manual step from the process.

Then pick the next manual validation and repeat.

Two rules make this cycle work:

Do not skip “validate.” Run the manual and automated checks in parallel long enough to prove the automation catches what the manual step caught. Without this evidence, the team will not trust the automation, and the manual step will creep back.
Do not skip “remove.” Keeping both the manual and automated checks adds cost without removing it. The goal is replacement, not duplication. Once the automated check is proven, retire the manual step explicitly.

Inventory Your Manual Validations

Before you can replace manual validations, you need to know what they are. A value stream map is the fastest way to find them. Walk the path from commit to production and mark every point where a human has to inspect, approve, verify, or execute something before the change can move forward.

Common manual validations and where they typically live:

Manual Validation	Where It Lives	What It Catches
Manual regression testing	QA team runs test cases before release	Functional regressions in existing features
Code style review	PR review checklist	Formatting, naming, structural consistency
Security review	Security team sign-off before deploy	Vulnerable dependencies, injection risks, auth gaps
Environment configuration	Ops team configures target environment	Missing env vars, wrong connection strings, incorrect feature flags
Smoke testing	Someone clicks through the app after deploy	Deployment-specific failures, broken integrations
Change advisory board	CAB meeting approves production changes	Risk assessment, change coordination, rollback planning
Database migration review	DBA reviews and runs migration scripts	Schema conflicts, data loss, performance regressions

Your inventory will include items not on this list. That is expected. The list above covers the most common ones, but every team has process-specific manual steps that accumulated over time.

Prioritize by Effort and Friction

Not all manual validations are equal. Some cause significant delay on every release. Others are quick and infrequent. Prioritize by mapping each validation on two axes:

Friction (vertical axis - how much pain the manual step causes):

How often does it run? (every commit, every release, quarterly)
How long does it take? (minutes, hours, days)
How often does it produce errors? (rarely, sometimes, frequently)

High-frequency, long-duration, error-prone validations cause the most friction.

Effort to automate (horizontal axis - how hard is the automation):

Is the codebase ready? (clean interfaces vs. tightly coupled)
Do tools exist? (linters, test frameworks, scanning tools)
Is the validation well-defined? (clear pass/fail vs. subjective judgment)

Start with high-friction, low-effort validations. These give you the fastest return and build momentum for harder automations later. This is the same constraint-based thinking described in Identify Constraints - fix the biggest bottleneck first.

	Low Effort	High Effort
High Friction	Start here - fastest return	Plan these - high value but need investment
Low Friction	Do these opportunistically	Defer - low return for high cost

Walkthrough: Replacing Manual Regression Testing

A concrete example of the full cycle applied to a common brownfield problem.

Starting state

The QA team runs 200 manual test cases before every release. The full regression suite takes three days. Releases happen every two weeks, so the team spends roughly 20% of every sprint on manual regression testing.

Step 1: Identify

The value stream map shows the 3-day manual regression cycle as the single largest wait time between “code complete” and “deployed.” This is the constraint.

Step 2: Automate (start small)

Do not attempt to automate all 200 test cases at once. Rank the test cases by two criteria:

Failure frequency: Which tests actually catch bugs? (In most suites, a small number of tests catch the majority of real regressions.)
Business criticality: Which tests cover the highest-risk functionality?

Pick the top 20 test cases by these criteria. Write automated tests for those 20 first. This is enough to start the validation step.

Step 3: Validate (parallel run)

Run the 20 automated tests alongside the full manual regression suite for two or three release cycles. Compare results:

Did the automated tests catch the same failures the manual tests caught?
Did the automated tests miss anything the manual tests caught?
Did the automated tests catch anything the manual tests missed?

Track these results explicitly. They are the evidence the team needs to trust the automation.

Step 4: Remove

Once the automated tests have proven equivalent for those 20 test cases across multiple cycles, remove those 20 test cases from the manual regression suite. The manual suite is now 180 test cases - taking roughly 2.7 days instead of 3.

Repeat

Pick the next 20 highest-value test cases. Automate them. Validate with parallel runs. Remove the manual cases. The manual suite shrinks with each cycle:

Cycle	Manual Test Cases	Manual Duration	Automated Tests
Start	200	3.0 days	0
1	180	2.7 days	20
2	160	2.4 days	40
3	140	2.1 days	60
4	120	1.8 days	80
5	100	1.5 days	100

Each cycle also gets faster because the team builds skill and the test infrastructure matures. For more on structuring automated tests effectively, see Testing Fundamentals and Functional Testing.

When Refactoring Is a Prerequisite

Sometimes you cannot automate a validation because the code is not structured for it. In these cases, refactoring is a prerequisite step within the replacement cycle - not a separate initiative.

Code-Level Blocker	Why It Prevents Automation	Refactoring Approach
Tight coupling between modules	Cannot test one module without setting up the entire system	Extract interfaces at module boundaries so modules can be tested in isolation
Hardcoded configuration	Cannot run the same code in test and production environments	Extract configuration into environment variables or config files
No clear entry points	Cannot call business logic without going through the UI	Extract business logic into callable functions or services
Shared mutable state	Test results depend on execution order and are not repeatable	Isolate state by passing dependencies explicitly instead of using globals
Scattered database access	Cannot test logic without a running database and specific data	Consolidate data access behind a repository layer that can be substituted in tests

The key discipline: refactor only the minimum needed for the specific validation you are automating. Do not expand the refactoring scope beyond what the current cycle requires. This keeps the refactoring small, low-risk, and tied to a concrete outcome.

For more on decoupling strategies, see Architecture Decoupling.

The Compounding Effect

Each completed replacement cycle frees time that was previously spent on manual validation. That freed time becomes available for the next automation cycle. The pace of migration accelerates as you progress:

Cycle	Manual Time per Release	Time Available for Automation	Cumulative Automated Checks
Start	5 days	Limited (squeezed between feature work)	0
After 2 cycles	4 days	1 day freed	2 validations automated
After 4 cycles	3 days	2 days freed	4 validations automated
After 6 cycles	2 days	3 days freed	6 validations automated
After 8 cycles	1 day	4 days freed	8 validations automated

Early cycles are the hardest because you have the least available time. This is why starting with the highest-friction, lowest-effort validation matters - it frees the most time for the least investment.

The same compounding dynamic applies to small batches - smaller changes are easier to validate, which makes each cycle faster, which enables even smaller changes.

Small Steps in Everything

The replacement cycle embodies the same small-batch discipline that CD itself requires. The principle applies at every level of the migration:

Automate one validation at a time. Do not try to build the entire pipeline in one sprint.
Refactor one module at a time. Do not launch a “tech debt initiative” to restructure the whole codebase before you can automate anything.
Remove one manual check at a time. Do not announce “we are eliminating manual QA” and try to do it all at once.

The risk of big-step migration:

The work stalls because the scope is too large to complete alongside feature delivery.
ROI is distant because nothing is automated until everything is automated.
Feature delivery suffers because the team is consumed by a transformation project instead of delivering value.

This connects directly to the brownfield migration principle: do not stop delivering features. The replacement cycle is designed to produce value at every iteration, not only at the end.

For more on decomposing work into small steps, see Work Decomposition.

Measuring Progress

Track these metrics to gauge migration progress. Start collecting them from baseline before you begin replacing validations.

Metric	What It Tells You	Target Direction
Manual validations remaining	How many manual steps still exist between commit and production	Down to zero
Time spent on manual validation per release	How much calendar time manual checks consume each release cycle	Decreasing each quarter
Pipeline coverage %	What percentage of validations are automated in the pipeline	Increasing toward 100%
Deployment frequency	How often you deploy to production	Increasing
Lead time for changes	Time from commit to production	Decreasing

If manual validations remaining is decreasing but deployment frequency is not increasing, you may be automating low-friction validations that are not on the critical path. Revisit your prioritization and focus on the validations that are actually blocking faster delivery.

Value Stream Mapping - Find your manual validations
Identify Constraints - Prioritize which validation to replace first
Baseline Metrics - Measure your starting point
Testing Fundamentals - Build automated tests that replace manual testing
Work Decomposition - Break migration work into small steps
Small Batches - The principle behind incremental replacement
Architecture Decoupling - Refactoring strategies for testability
Deterministic Pipeline - Where automated validations live
Functional Testing - Structuring automated functional tests
Pipeline Enforcement and Expert Agents - Applying the replacement cycle to AI expert validation agents

7 - CD for Greenfield Projects

Starting a new project? Build continuous delivery in from day one instead of retrofitting it later.

Starting with CD is dramatically easier than migrating to it. When there is no legacy process, no existing test suite to fix, and no entrenched habits to change, you can build the right practices from the first commit. This section shows you how.

Why Start with CD

Teams that build CD into a new project from the beginning avoid the most painful parts of the migration journey. There is no test suite to rewrite, no branching strategy to unwind, no deployment process to automate after the fact. Every practice described in this guide can be adopted on day one when there is no existing codebase to constrain you.

The cost of adopting CD practices in a greenfield project is near zero. The cost of retrofitting them into a mature codebase can be months of work. The earlier you start, the less it costs.

What to Build from Day One

Pipeline first

Before writing application code, set up your delivery pipeline. The pipeline is feature zero. Your first commit should include:

A build script that compiles, tests, and packages the application
A CI configuration that runs on every push to trunk
A deployment mechanism (even if the first “deployment” is to a local environment)
Every validation you know you will need from the start

The validations you put in the pipeline on day one define the quality standard for the application. They are not overhead you add later - they are the mold that shapes every line of code that follows. If you add linting after 10,000 lines of code, you are fixing 10,000 lines of code. If you add it before the first line, every line is written to the standard.

Feature zero validations:

Code style and formatting - Enforce a formatter (Prettier, Black, gofmt) so style is never a code review conversation. The pipeline rejects code that is not formatted.
Linting - Static analysis rules for your language (ESLint, pylint, golangci-lint). Catches bugs, enforces idioms, and prevents anti-patterns before review.
Type checking - If your language supports static types (TypeScript, mypy, Java), enable strict mode from the start. Relaxing later is easy. Tightening later is painful.
Test framework - The test runner is configured and a first test exists, even if it only asserts that the application starts. The team should never have to set up testing infrastructure - it is already there.
Security scanning - Dependency vulnerability scanning (Dependabot, Snyk, Trivy) and basic SAST rules. Security findings block the build from day one, so the team never accumulates a backlog of vulnerabilities.
Commit message or PR conventions - If you enforce conventional commits, changelog generation, or PR title formats, add the check now.

Every one of these is trivial to add to an empty project and expensive to retrofit into a mature codebase. The pipeline enforces them automatically, so the team never has to argue about them in review. The conversation shifts from “should we fix this?” to “the pipeline already enforces this.”

The pipeline should exist before the first feature. Every feature you build will flow through it and meet every standard you defined on day one.

Deploy “hello world” to production

Your first deployment should happen before your first feature. Deploy the simplest possible application - a health check endpoint, a static page, a “hello world” - all the way to production through your pipeline. This is the single most important validation you can do early because it proves the entire path works: build, test, package, deploy, verify.

Why production, not staging: The goal is to prove the full path works end-to-end. If you deploy only to a staging environment, you have proven that the pipeline works up to staging. You have not proven that production credentials, network routes, DNS, load balancers, permissions, and deployment targets are correctly configured. Every gap between your test environment and production is an assumption that will be tested for the first time under pressure, when it matters most.

Deploy “hello world” to production on day one, and you will discover:

Whether the team has the access and permissions to deploy
Whether the infrastructure provisioning actually works
Whether the deployment mechanism handles a real production environment
Whether monitoring and health checks are wired up correctly
Whether rollback works before you need it in an emergency

All of these are problems you want to find with a “hello world,” not with a real feature under a deadline.

Warning: deploying only to lower environments

If organizational constraints prevent you from deploying to production immediately, deploy as close to production as you can. But be explicit about what this means: every environment that is not production is an approximation. Lower environments may differ in network topology, security policies, resource capacity, data volume, and third-party integrations. Each difference is a gap in your confidence.

Track these gaps. Document every known difference between your deployment target and production. Treat closing each gap as a priority, because until you have deployed to production through your pipeline, you have not fully validated the path. The longer you wait, the more assumptions accumulate, and the riskier the first real production deployment becomes.

Trunk-based development from the start

There is no reason to start with long-lived branches. From commit one:

All work happens on trunk (or short-lived branches that merge to trunk within a day)
The pipeline runs on every integration to trunk
Trunk is always in a deployable state

See Trunk-Based Development for the practices.

Test architecture from the start

Design your test architecture before you have tests to migrate. Establish:

Unit tests for all business logic
Integration tests for every external boundary (databases, APIs, message queues)
Functional tests that exercise your service in isolation with test doubles for dependencies
Contract tests for every external dependency
A clear rule: everything that blocks deployment is deterministic

See Testing Fundamentals for the full test architecture.

Small, vertical slices from the start

Decompose the first features into small, independently deployable increments. Establish the habit of delivering thin vertical slices before the team has a chance to develop a batch mindset.

See Work Decomposition for slicing techniques.

Greenfield Checklist

Use this checklist to verify your new project is set up for CD from the start.

Pipeline Basics

CI pipeline runs on every push to trunk
Build, test, and package happen with a single command
First unit test exists and passes
All work integrates to trunk at least daily
Deployment to at least one environment is automated

Quality Gates

Test architecture established (unit, integration, functional layers)
External dependencies use test doubles in the deterministic test suite
Contract tests exist for at least one external dependency
Pipeline deploys to a production-like environment
Rollback is tested and works
Application configuration is externalized
Artifacts are immutable (build once, deploy everywhere)

Production Readiness

Pipeline deploys to production
Every commit that passes the pipeline is a deployment candidate
Deployment is a routine, low-risk event
Feature flags decouple deployment from release
DORA metrics are tracked (deployment frequency, lead time, change failure rate, MTTR)

Common Mistakes in Greenfield Projects

Mistake	Why it happens	What to do instead
“We’ll add tests later”	Pressure to show progress on features	Write the first test before the first feature. TDD from day one.
“We’ll set up the pipeline later”	Pipeline feels like overhead when there’s little code	The pipeline is the first thing you build. Features flow through it.
Starting with feature branches	Habit from previous projects	Trunk-based development from commit one. No reason to start with branches.
Designing for scale before you have users	Over-engineering from the start	Build the simplest thing that works. Deploy frequently. Evolve the architecture based on real feedback.
Skipping contract tests because “we own both services”	Feels redundant when one team owns everything	You will not own everything forever. Contract tests are cheap to add early and expensive to add later.

Testing Fundamentals - Build the right test architecture from the start
Trunk-Based Development - The branching model for CD
Pipeline Architecture - Design your pipeline structure
Work Decomposition - Deliver in small, vertical slices
Feature Flags - Decouple deployment from release

Migrate to CD

The Phases

Where to Start

Related Content

1 - Phase 0: Assess

What You’ll Do

Why This Phase Matters

When You’re Ready to Move On

Related Content

1.1 - Value Stream Mapping

What Is a Value Stream Map?

Prerequisites

Choose Your Mapping Approach

Bottom-Up: Map at the Team Level First

Top-Down: Map Across Dependent Teams

Combining Both Approaches

How to Run the Session

Step 1: Start From Delivery, Work Backward

Step 2: Capture Process Time and Wait Time for Each Step

Step 3: Calculate %C/A at Each Step

Step 4: Identify Constraints (Kaizen Bursts)

Reading the Results

What Good Looks Like

Next Step

Related Content

1.2 - Baseline Metrics

Why Measure Before Changing

The Four Essential Metrics

1. Deployment Frequency

2. Lead Time for Changes

3. Change Failure Rate

4. Mean Time to Restore (MTTR)

Capturing Your Baselines

What Your Baselines Tell You About Where to Focus

A Warning About Metrics

Next Step

Related Content

1.3 - Identify Constraints

The Theory of Constraints

Common Constraint Categories

Testing Bottlenecks

Deployment Gates

Environment Provisioning

Code Review Delays

Manual Handoffs

Using Your Value Stream Map to Find the Constraint

Step 1: Rank Steps by Wait Time

Step 2: Look for Rework Loops

Step 3: Count Handoffs

Step 4: Cross-Reference with Metrics

Prioritizing: Fix the Biggest One First

The Next Constraint

Next Step

Related Content

1.4 - Current State Checklist

How to Use This Checklist

Trunk-Based Development

Continuous Integration

Pipeline Practices

Deployment

Quality

Scoring Guide

Putting It All Together

Next Step

Related Content

2 - Phase 1: Foundations

What You’ll Do

Why This Phase Matters

When You’re Ready to Move On

Related Content

2.1 - Trunk-Based Development

What Is Trunk-Based Development?

What TBD Is Not

What Trunk-Based Development Improves

Two Migration Paths

Path 1: Short-Lived Branches

Path 2: Direct Trunk Commits

How to Choose Your Path

Essential Supporting Practices

Feature Flags