Phase 3: Optimize
Improve flow by reducing batch size, limiting work in progress, and using metrics to drive improvement.
Key question: “Can we deliver small changes quickly?”
With a working pipeline in place, this phase focuses on optimizing the flow of changes
through it. Smaller batches, feature flags, and WIP limits reduce risk and increase
delivery frequency.
What You’ll Do
- Reduce batch size - Deliver smaller, more frequent changes
- Use feature flags - Decouple deployment from release
- Limit work in progress - Focus on finishing over starting
- Drive improvement with metrics - Use DORA metrics and improvement kata
- Run effective retrospectives - Continuously improve the delivery process
- Decouple architecture - Enable independent deployment of components
- Align teams to code - Match team ownership to code boundaries for independent deployment
Why This Phase Matters
Having a pipeline isn’t enough - you need to optimize the flow through it. Teams that
deploy weekly with a CD pipeline are missing most of the benefits. Small batches reduce
risk, feature flags enable testing in production, and metrics-driven improvement creates
a virtuous cycle of getting better at getting better.
When You’re Ready to Move On
You’re ready for Phase 4: Deliver on Demand when:
- Most changes are small enough to deploy independently
- Feature flags let you deploy incomplete features safely
- Your WIP limits keep work flowing without bottlenecks
- You’re measuring and improving your DORA metrics regularly
Next: Phase 4 - Continuous Deployment - remove the last manual gates and deploy on demand.
Related Content
1 - Small Batches
Deliver smaller, more frequent changes to reduce risk and increase feedback speed.
Phase 3 - Optimize
Batch size is the single biggest lever for improving delivery performance. This page covers what batch size means at every level - deploy frequency, commit size, and story size - and provides concrete techniques for reducing it.
Why Batch Size Matters
Large batches create large risks. When you deploy 50 changes at once, any failure could be caused by any of those 50 changes. When you deploy 1 change, the cause of any failure is obvious.
This is not a theory. The DORA research consistently shows that elite teams deploy more frequently, with smaller changes, and have both higher throughput and lower failure rates. Small batches are the mechanism that makes this possible.
“If it hurts, do it more often, and bring the pain forward.”
- Jez Humble, Continuous Delivery
Three Levels of Batch Size
Batch size is not just about deployments. It operates at three distinct levels, and optimizing only one while ignoring the others limits your improvement.
Level 1: Deploy Frequency
How often you push changes to production.
| State | Deploy Frequency | Risk Profile |
|---|
| Starting | Monthly or quarterly | Each deploy is a high-stakes event |
| Improving | Weekly | Deploys are planned but routine |
| Optimizing | Daily | Deploys are unremarkable |
| Elite | Multiple times per day | Deploys are invisible |
How to reduce: Remove manual gates, automate approval workflows, build confidence through progressive rollout. If your pipeline is reliable (Phase 2), the only thing preventing more frequent deploys is organizational habit.
Common objections to deploying more often:
- “Incomplete features have no value.” Value is not limited to end-user features. Every deployment provides value to other stakeholders: operations verifies that the change is safe, QA confirms quality gates pass, and the team reduces inventory waste by keeping unintegrated work near zero. A partially built feature deployed behind a flag validates the deployment pipeline and reduces the risk of the final release.
- “Our customers don’t want changes that frequently.” CD is not about shipping user-visible changes every hour. It is about maintaining the ability to deploy at any time. That ability is what lets you ship an emergency fix in minutes instead of days, roll out a security patch without a war room, and support production without heroics.
Level 2: Commit Size
How much code changes in each commit to trunk.
| Indicator | Too Large | Right-Sized |
|---|
| Files changed | 20+ files | 1-5 files |
| Lines changed | 500+ lines | Under 100 lines |
| Review time | Hours or days | Minutes |
| Merge conflicts | Frequent | Rare |
| Description length | Paragraph needed | One sentence suffices |
How to reduce: Practice TDD (write one test, make it pass, commit). Use feature flags to merge incomplete work. Pair program so review happens in real time.
Level 3: Story Size
How much scope each user story or work item contains.
A story that takes a week to complete is a large batch. It means a week of work piles up before integration, a week of assumptions go untested, and a week of inventory sits in progress.
Target: Every story should be completable - coded, tested, reviewed, and integrated - in two days or less. If it cannot be, it needs to be decomposed further.
“If a story is going to take more than a day to complete, it is too big.”
This target is not aspirational. Teams that adopt hyper-sprints - iterations as short as 2.5 days - find that the discipline of writing one-day stories forces better decomposition and faster feedback. Teams that make this shift routinely see throughput double, not because people work faster, but because smaller stories flow through the system with less wait time, fewer handoffs, and fewer defects.
Behavior-Driven Development for Decomposition
BDD provides a concrete technique for breaking stories into small, testable increments. The Given-When-Then format forces clarity about scope.
The Given-When-Then Pattern
Each scenario becomes a deliverable increment. You can implement and deploy the first scenario before starting the second. This is how you turn a “discount feature” (large batch) into three independent, deployable changes (small batches).
Decomposing Stories Using Scenarios
When a story has too many scenarios, it is too large. Use this process:
- Write all the scenarios first. Before any code, enumerate every Given-When-Then for the story.
- Group scenarios into deliverable slices. Each slice should be independently valuable or at least independently deployable.
- Create one story per slice. Each story has 1-3 scenarios and can be completed in 1-2 days.
- Order the slices by value. Deliver the most important behavior first.
Example decomposition:
| Original Story | Scenarios | Sliced Into |
|---|
| “As a user, I can manage my profile” | 12 scenarios covering name, email, password, avatar, notifications, privacy, deactivation | 5 stories: basic info (2 scenarios), password (2), avatar (2), notifications (3), deactivation (3) |
ATDD: Connecting Scenarios to Daily Integration
BDD scenarios define what to build. Acceptance Test-Driven Development (ATDD) defines how to build it in small, integrated steps. The workflow is:
- Pick one scenario. Choose the next Given-When-Then from your story.
- Write the acceptance test first. Automate the scenario so it runs against the real system (or a close approximation). It will fail - this is the RED state.
- Write just enough code to pass. Implement the minimum production code to make the acceptance test pass - the GREEN state.
- Refactor. Clean up the code while the test stays green.
- Commit and integrate. Push to trunk. The pipeline verifies the change.
- Repeat. Pick the next scenario.
Each cycle produces a commit that is independently deployable and verified by an automated test. This is how BDD scenarios translate directly into a stream of small, safe integrations rather than a batch of changes delivered at the end of a story.
Key benefits:
- Every commit has a corresponding acceptance test, so you know exactly what it does and that it works.
- You never go more than a few hours without integrating to trunk.
- The acceptance tests accumulate into a regression suite that protects future changes.
- If a commit breaks something, the scope of the change is small enough to diagnose quickly.
Service-Level Decomposition Example
ATDD works at the API and service level, not just at the UI level. Here is an example of building an order history endpoint day by day:
Day 1 - Return an empty list for a customer with no orders:
Commit: Implement the endpoint, return an empty JSON array. Acceptance test passes.
Day 2 - Return a single order with basic fields:
Commit: Query the orders table, serialize basic fields. Previous test still passes.
Day 3 - Return multiple orders sorted by date:
Commit: Add sorting logic and pagination. All three tests pass.
Each day produces a deployable change. The endpoint is usable (though minimal) after day 1. No day requires more than a few hours of coding because the scope is constrained by a single scenario.
Vertical Slicing
A vertical slice cuts through all layers of the system to deliver a thin piece of end-to-end functionality. This is the opposite of horizontal slicing, where you build all the database changes, then all the API changes, then all the UI changes.
Horizontal vs. Vertical Slicing
Horizontal (avoid):
Story 1: Build the database schema for discounts
Story 2: Build the API endpoints for discounts
Story 3: Build the UI for applying discounts
Problems: Story 1 and 2 deliver no user value. You cannot test end-to-end until story 3 is done. Integration risk accumulates.
Vertical (prefer):
Story 1: Apply a simple percentage discount (DB + API + UI for one scenario)
Story 2: Reject expired discount codes (DB + API + UI for one scenario)
Story 3: Apply discounts only to eligible items (DB + API + UI for one scenario)
Benefits: Every story delivers testable, deployable functionality. Integration happens with each story, not at the end. You can ship story 1 and get feedback before building story 2.
How to Slice Vertically
Ask these questions about each proposed story:
- Can a user (or another system) observe the change? If not, slice differently.
- Can I write an end-to-end test for it? If not, the slice is incomplete.
- Does it require all other slices to be useful? If yes, find a thinner first slice.
- Can it be deployed independently? If not, check whether feature flags could help.
Vertical slicing in distributed systems
The examples above assume a team that owns the full stack - UI, API, and database. In large distributed systems, most teams own a subdomain and may not be directly user-facing.
The principle is the same. A subdomain product team’s vertical slice cuts through all layers they control: the service API, the business logic, and the data store. “End-to-end” means end-to-end within your domain, not end-to-end across the entire system. The team deploys independently behind a stable contract, without coordinating with other teams.
The key difference is whether the public interface is designed for humans or machines. A full-stack product team owns a human-facing surface - the slice is done when a user can observe the behavior through that interface. A subdomain product team owns a machine-facing surface - the slice is done when the API contract satisfies the agreed behavior for its service consumers.
See Work Decomposition for diagrams of both contexts, and Horizontal Slicing for the failure mode that emerges when distributed teams split work by layer instead of by behavior.
Story Slicing Anti-Patterns
These are common ways teams slice stories that undermine the benefits of small batches:
Wrong: Slice by layer.
“Story 1: Build the database. Story 2: Build the API. Story 3: Build the UI.”
Right: Slice vertically so each story touches all layers and delivers observable behavior.
Wrong: Slice by activity.
“Story 1: Design. Story 2: Implement. Story 3: Test.”
Right: Each story includes all activities needed to deliver and verify one behavior.
Wrong: Create dependent stories.
“Story 2 cannot start until Story 1 is finished because it depends on the data model.”
Right: Each story is independently deployable. Use contracts, feature flags, or stubs to break dependencies between stories.
Wrong: Lose testability.
“This story just sets up infrastructure - there is nothing to test yet.”
Right: Every story has at least one automated test that verifies its behavior. If you cannot write a test, the slice does not deliver observable value.
Practical Steps for Reducing Batch Size
Step 1: Measure Current State
Before changing anything, measure where you are:
- Average commit size (lines changed per commit)
- Average story cycle time (time from start to done)
- Deploy frequency (how often changes reach production)
- Average changes per deploy (how many commits per deployment)
Step 2: Introduce Story Decomposition
- Start writing BDD scenarios before implementation
- Split any story estimated at more than 2 days
- Track the number of stories completed per week (expect this to increase as stories get smaller)
Step 3: Tighten Commit Size
- Adopt the discipline of “one logical change per commit”
- Use TDD to create a natural commit rhythm: write test, make it pass, commit
- Track average commit size and set a team target (e.g., under 100 lines)
Ongoing: Increase Deploy Frequency
- Deploy at least once per day, then work toward multiple times per day
- Remove any batch-oriented processes (e.g., “we deploy on Tuesdays”)
- Make deployment a non-event
Key Pitfalls
1. “Small stories take more overhead to manage”
This is true only if your process adds overhead per story (e.g., heavyweight estimation ceremonies, multi-level approval). The solution is to simplify the process, not to keep stories large. Overhead per story should be near zero for a well-decomposed story.
2. “Some things can’t be done in small batches”
Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. API changes can use versioning. UI changes can be hidden behind feature flags. The skill is in finding the decomposition, not in deciding whether one exists.
3. “We tried small stories but our throughput dropped”
This usually means the team is still working sequentially. Small stories require limiting WIP and swarming - see Limiting WIP. If the team starts 10 small stories instead of 2 large ones, they have not actually reduced batch size; they have increased WIP.
Measuring Success
Next Step
Small batches often require deploying incomplete features to production. Feature Flags provide the mechanism to do this safely.
Related Content
2 - Feature Flags
Decouple deployment from release by using feature flags to control feature visibility.
Phase 3 - Optimize
Feature flags are the mechanism that makes trunk-based development and small batches safe. They let you deploy code to production without exposing it to users, enabling dark launches, gradual rollouts, and instant rollback of features without redeploying.
Why Feature Flags?
In continuous delivery, deployment and release are two separate events:
- Deployment is pushing code to production.
- Release is making a feature available to users.
Feature flags are the bridge between these two events. They let you deploy frequently (even multiple times a day) without worrying about exposing incomplete or untested features. This separation is what makes continuous deployment possible for teams that ship real products to real users.
When You Need Feature Flags (and When You Don’t)
Not every change requires a feature flag. Flags add complexity, and unnecessary complexity slows you down. Use this decision tree to determine the right approach.
Decision Tree
graph TD
Start[New Code Change] --> Q1{Is this a large or<br/>high-risk change?}
Q1 -->|Yes| Q2{Do you need gradual<br/>rollout or testing<br/>in production?}
Q1 -->|No| Q3{Is the feature<br/>incomplete or spans<br/>multiple releases?}
Q2 -->|Yes| UseFF1[YES - USE FEATURE FLAG<br/>Enables safe rollout<br/>and quick rollback]
Q2 -->|No| Q4{Do you need to<br/>test in production<br/>before full release?}
Q3 -->|Yes| Q3A{Can you use an<br/>alternative pattern?}
Q3 -->|No| Q5{Do different users/<br/>customers need<br/>different behavior?}
Q3A -->|New Feature| NoFF_NewFeature[NO FLAG NEEDED<br/>Connect to tests only,<br/>integrate in final commit]
Q3A -->|Behavior Change| NoFF_Abstraction[NO FLAG NEEDED<br/>Use branch by<br/>abstraction pattern]
Q3A -->|New API Route| NoFF_API[NO FLAG NEEDED<br/>Build route, expose<br/>as last change]
Q3A -->|Not Applicable| UseFF2[YES - USE FEATURE FLAG<br/>Enables trunk-based<br/>development]
Q4 -->|Yes| UseFF3[YES - USE FEATURE FLAG<br/>Dark launch or<br/>beta testing]
Q4 -->|No| Q6{Is this an<br/>experiment or<br/>A/B test?}
Q5 -->|Yes| UseFF4[YES - USE FEATURE FLAG<br/>Customer-specific<br/>toggles needed]
Q5 -->|No| Q7{Does change require<br/>coordination with<br/>other teams/services?}
Q6 -->|Yes| UseFF5[YES - USE FEATURE FLAG<br/>Required for<br/>experimentation]
Q6 -->|No| NoFF1[NO FLAG NEEDED<br/>Simple change,<br/>deploy directly]
Q7 -->|Yes| UseFF6[YES - USE FEATURE FLAG<br/>Enables independent<br/>deployment]
Q7 -->|No| Q8{Is this a bug fix<br/>or hotfix?}
Q8 -->|Yes| NoFF2[NO FLAG NEEDED<br/>Deploy immediately]
Q8 -->|No| NoFF3[NO FLAG NEEDED<br/>Standard deployment<br/>sufficient]
style UseFF1 fill:#90EE90
style UseFF2 fill:#90EE90
style UseFF3 fill:#90EE90
style UseFF4 fill:#90EE90
style UseFF5 fill:#90EE90
style UseFF6 fill:#90EE90
style NoFF1 fill:#FFB6C6
style NoFF2 fill:#FFB6C6
style NoFF3 fill:#FFB6C6
style NoFF_NewFeature fill:#FFB6C6
style NoFF_Abstraction fill:#FFB6C6
style NoFF_API fill:#FFB6C6
style Start fill:#87CEEBAlternatives to Feature Flags
| Technique | How It Works | When to Use |
|---|
| Branch by Abstraction | Introduce an abstraction layer, build the new implementation behind it, switch when ready | Replacing an existing subsystem or library |
| Connect Tests Last | Build internal components without connecting them to the UI or API | New backend functionality that has no user-facing impact until connected |
| Dark Launch | Deploy the code path but do not route any traffic to it | New infrastructure, new services, or new endpoints that are not yet referenced |
These alternatives avoid the lifecycle overhead of feature flags while still enabling trunk-based development with incomplete work.
Implementation Approaches
Feature flags can be implemented at different levels of sophistication. Start simple and add complexity only when needed.
Level 1: Static Code-Based Flags
The simplest approach: a boolean constant or configuration value checked in code.
Pros: Zero infrastructure. Easy to understand. Works everywhere.
Cons: Changing a flag requires a deployment. No per-user targeting. No gradual rollout.
Best for: Teams starting out. Internal tools. Changes that will be fully on or fully off.
Level 2: Dynamic In-Process Flags
Flags stored in a configuration file, database, or environment variable that can be changed at runtime without redeploying.
Pros: No redeployment needed. Supports percentage rollout. Simple to implement.
Cons: Each instance reads its own config - no centralized view. Limited targeting capabilities.
Best for: Teams that need gradual rollout but do not want to adopt a third-party service yet.
Level 3: Centralized Flag Service
A dedicated service (self-hosted or SaaS) that manages all flags, provides a dashboard, supports targeting rules, and tracks flag usage.
Examples: LaunchDarkly, Unleash, Flagsmith, Split, or a custom internal service.
Pros: Centralized management. Rich targeting (by user, plan, region, etc.). Audit trail. Real-time changes.
Cons: Added dependency. Cost (for SaaS). Network latency for flag evaluation (mitigated by local caching in most SDKs).
Best for: Teams at scale. Products with diverse user segments. Regulated environments needing audit trails.
Level 4: Infrastructure Routing
Instead of checking flags in application code, route traffic at the infrastructure level (load balancer, service mesh, API gateway).
Pros: No application code changes. Clean separation of routing from logic. Works across services.
Cons: Requires infrastructure investment. Less granular than application-level flags. Harder to target individual users.
Best for: Microservice architectures. Service-level rollouts. A/B testing at the infrastructure layer.
Feature Flag Lifecycle
Every feature flag has a lifecycle. Flags that are not actively managed become technical debt. Follow this lifecycle rigorously.
The Stages
1. CREATE → Define the flag, document its purpose and owner
2. DEPLOY OFF → Code ships to production with the flag disabled
3. BUILD → Incrementally add functionality behind the flag
4. DARK LAUNCH → Enable for internal users or a small test group
5. ROLLOUT → Gradually increase the percentage of users
6. REMOVE → Delete the flag and the old code path
Stage 1: Create
Before writing any code, define the flag:
- Name: Use a consistent naming convention (e.g.,
enable-new-checkout, feature.discount-engine) - Owner: Who is responsible for this flag through its lifecycle?
- Purpose: One sentence describing what the flag controls
- Planned removal date: Set this at creation time. Flags without removal dates become permanent.
Stage 2: Deploy OFF
The first deployment includes the flag check but the flag is disabled. This verifies that:
- The flag infrastructure works
- The default (off) path is unaffected
- The flag check does not introduce performance issues
Stage 3: Build Incrementally
Continue building the feature behind the flag over multiple deploys. Each deploy adds more functionality, but the flag remains off for users. Test both paths in your automated suite:
Stage 4: Dark Launch
Enable the flag for internal users or a specific test group. This is your first validation with real production data and real traffic patterns. Monitor:
- Error rates for the flagged group vs. control
- Performance metrics (latency, throughput)
- Business metrics (conversion, engagement)
Stage 5: Gradual Rollout
Increase exposure systematically:
| Step | Audience | Duration | What to Watch |
|---|
| 1 | 1% of users | 1-2 hours | Error rates, latency |
| 2 | 5% of users | 4-8 hours | Performance at slightly higher load |
| 3 | 25% of users | 1 day | Business metrics begin to be meaningful |
| 4 | 50% of users | 1-2 days | Statistically significant business impact |
| 5 | 100% of users | - | Full rollout |
At any step, if metrics degrade, roll back by disabling the flag. No redeployment needed.
Stage 6: Remove
This is the most commonly skipped step, and skipping it creates significant technical debt.
Once the feature has been stable at 100% for an agreed period (e.g., 2 weeks):
- Remove the flag check from code
- Remove the old code path
- Remove the flag definition from the flag service
- Deploy the simplified code
Set a maximum flag lifetime. A common practice is 90 days. Any flag older than 90 days triggers an automatic review. Stale flags are a maintenance burden and a source of confusion.
Lifecycle Timeline Example
| Day | Action | Flag State |
|---|
| 1 | Deploy flag infrastructure and create removal ticket | OFF |
| 2-5 | Build feature behind flag, integrate daily | OFF |
| 6 | Enable for internal users (dark launch) | ON for 0.1% |
| 7 | Enable for 1% of users | ON for 1% |
| 8 | Enable for 5% of users | ON for 5% |
| 9 | Enable for 25% of users | ON for 25% |
| 10 | Enable for 50% of users | ON for 50% |
| 11 | Enable for 100% of users | ON for 100% |
| 12-18 | Stability period (monitor) | ON for 100% |
| 19-21 | Remove flag from code | DELETED |
Total lifecycle: approximately 3 weeks from creation to removal.
Long-Lived Feature Flags
Not all flags are temporary. Some flags are intentionally permanent and should be managed differently from release flags.
Operational Flags (Kill Switches)
Purpose: Disable expensive or non-critical features under load during incidents.
Lifecycle: Permanent.
Management: Treat as system configuration, not as a release mechanism.
Customer-Specific Toggles
Purpose: Different customers receive different features based on their subscription or contract.
Lifecycle: Permanent, tied to customer configuration.
Management: Part of the customer entitlement system, not the feature flag system.
Experimentation Flags
Purpose: A/B testing and experimentation.
Lifecycle: The flag infrastructure is permanent, but individual experiments expire.
Management: Each experiment has its own expiration date and success criteria. The experimentation platform itself persists.
Managing Long-Lived Flags
Long-lived flags need different discipline than temporary ones:
- Use a separate naming convention (e.g.,
KILL_SWITCH_*, ENTITLEMENT_*) to distinguish them from temporary release flags - Document why each flag is permanent so future team members understand the intent
- Store them separately from temporary flags in your management system
- Review regularly to confirm they are still needed
Key Pitfalls
1. “We have 200 feature flags and nobody knows what they all do”
This is flag debt, and it is as damaging as any other technical debt. Prevent it by enforcing the lifecycle: every flag has an owner, a purpose, and a removal date. Run a monthly flag audit.
2. “We use flags for everything, including configuration”
Feature flags and configuration are different concerns. Flags are temporary (they control unreleased features). Configuration is permanent (it controls operational behavior like timeouts, connection pools, log levels). Mixing them leads to confusion about what can be safely removed.
3. “Testing both paths doubles our test burden”
It does increase test effort, but this is a temporary cost. When the flag is removed, the extra tests go away too. The alternative - deploying untested code paths - is far more expensive.
4. “Nested flags create combinatorial complexity”
Avoid nesting flags whenever possible. If feature B depends on feature A, do not create a separate flag for B. Instead, extend the behavior behind feature A’s flag. If you must nest, document the dependency and test the specific combinations that matter.
Flag Removal Anti-Patterns
These specific patterns are the most common ways teams fail at flag cleanup.
Don’t skip the removal ticket:
- WRONG: “We’ll remove it later when we have time”
- RIGHT: Create a removal ticket at the same time you create the flag
Don’t leave flags after full rollout:
- WRONG: Flag still in code 6 months after 100% rollout
- RIGHT: Remove within 2-4 weeks of full rollout
Don’t forget to remove the old code path:
- WRONG: Flag removed but old implementation still in the codebase
- RIGHT: Remove the flag check AND the old implementation together
Don’t keep flags “just in case”:
- WRONG: “Let’s keep it in case we need to roll back in the future”
- RIGHT: After the stability period, rollback is handled by deployment, not by re-enabling a flag
Measuring Success
| Metric | Target | Why It Matters |
|---|
| Active flag count | Stable or decreasing | Confirms flags are being removed, not accumulating |
| Average flag age | < 90 days | Catches stale flags before they become permanent |
| Flag-related incidents | Near zero | Confirms flag management is not causing problems |
| Time from deploy to release | Hours to days (not weeks) | Confirms flags enable fast, controlled releases |
Next Step
Small batches and feature flags let you deploy more frequently, but deploying more means more work in progress. Limiting WIP ensures that increased deploy frequency does not create chaos.
Related Content
3 - Limiting Work in Progress
Focus on finishing work over starting new work to improve flow and reduce cycle time.
Phase 3 - Optimize
Work in progress (WIP) is inventory. Like physical inventory, it loses value the longer it sits unfinished. Limiting WIP is the most counterintuitive and most impactful practice in this entire migration: doing less work at once makes you deliver more.
Why Limiting WIP Matters
Every item of work in progress has a cost:
- Context switching: Moving between tasks destroys focus. Research consistently shows that switching between two tasks reduces productive time by 20-40%.
- Delayed feedback: Work that is started but not finished cannot be validated by users. The longer it sits, the more assumptions go untested.
- Hidden dependencies: The more items in progress simultaneously, the more likely they are to conflict, block each other, or require coordination.
- Longer cycle time: Little’s Law states that cycle time = WIP / throughput. If throughput is constant, the only way to reduce cycle time is to reduce WIP.
“Stop starting, start finishing.”
How to Set Your WIP Limit
The N+2 Starting Point
A practical starting WIP limit for a team is N+2, where N is the number of team members actively working on delivery.
| Team Size | Starting WIP Limit | Rationale |
|---|
| 3 developers | 5 items | Allows one item per person plus a small buffer |
| 5 developers | 7 items | Same principle at larger scale |
| 8 developers | 10 items | Buffer becomes proportionally smaller |
Why N+2 and not N? Because some items will be blocked waiting for review, testing, or external dependencies. A small buffer prevents team members from being idle when their primary task is blocked. But the buffer should be small - two items, not ten.
Continuously Lower the Limit
The N+2 formula is a starting point, not a destination. Once the team is comfortable with the initial limit, reduce it:
- Start at N+2. Run for 2-4 weeks. Observe where work gets stuck.
- Reduce to N+1. Tighten the limit. Some team members will occasionally be “idle” - this is a feature, not a bug. They should swarm on blocked items.
- Reduce to N. At this point, every team member is working on exactly one thing. Blocked work gets immediate attention because someone is always available to help.
- Consider going below N. Some teams find that pairing (two people, one item) further reduces cycle time. A team of 6 with a WIP limit of 3 means everyone is pairing.
Each reduction will feel uncomfortable. That discomfort is the point - it exposes problems in your workflow that were previously hidden by excess WIP.
What Happens When You Hit the Limit
When the team reaches its WIP limit and someone finishes a task, they have two options:
- Pull the next highest-priority item (if the WIP limit allows it).
- Swarm on an existing item that is blocked, stuck, or nearing its cycle time target.
When the WIP limit is reached and no items are complete:
- Do not start new work. This is the hardest part and the most important.
- Help unblock existing work. Pair with someone. Review a pull request. Write a missing test. Talk to the person who has the answer to the blocking question.
- Improve the process. If nothing is blocked but everything is slow, this is the time to work on automation, tooling, or documentation.
Swarming
Swarming is the practice of multiple team members working together on a single item to get it finished faster. It is the natural complement to WIP limits.
When to Swarm
- An item has been in progress for longer than the team’s cycle time target (e.g., more than 2 days)
- An item is blocked and the blocker can be resolved by another team member
- The WIP limit is reached and someone needs work to do
- A critical defect needs to be fixed immediately
How to Swarm Effectively
| Approach | How It Works | Best For |
|---|
| Pair programming | Two developers work on the same item at the same machine | Complex logic, knowledge transfer, code that needs review |
| Mob programming | The whole team works on one item together | Critical path items, complex architectural decisions |
| Divide and conquer | Break the item into sub-tasks and assign them | Items that can be parallelized (e.g., frontend + backend + tests) |
| Unblock and return | One person resolves the blocker, then hands back | External dependencies, environment issues, access requests |
Why Teams Resist Swarming
The most common objection: “It’s inefficient to have two people on one task.” This is only true if you measure efficiency as “percentage of time each person is writing new code.” If you measure efficiency as “how quickly value reaches production,” swarming is almost always faster because it reduces handoffs, wait time, and rework.
How Limiting WIP Exposes Workflow Issues
One of the most valuable effects of WIP limits is that they make hidden problems visible. When you cannot start new work, you are forced to confront the problems that slow existing work down.
| Symptom When WIP Is Limited | Root Cause Exposed |
|---|
| “I’m idle because my PR is waiting for review” | Code review process is too slow |
| “I’m idle because I’m waiting for the test environment” | Not enough environments, or environments are not self-service |
| “I’m idle because I’m waiting for the product owner to clarify requirements” | Stories are not refined before being pulled into the sprint |
| “I’m idle because my build is broken and I can’t figure out why” | Build is not deterministic, or test suite is flaky |
| “I’m idle because another team hasn’t finished the API I depend on” | Architecture is too tightly coupled (see Architecture Decoupling) |
Each of these is a bottleneck that was previously invisible because the team could always start something else. With WIP limits, these bottlenecks become obvious and demand attention.
Implementing WIP Limits
Step 1: Make WIP Visible
Before setting limits, make current WIP visible:
- Count the number of items currently “in progress” for the team
- Write this number on the board (physical or digital) every day
- Most teams are shocked by how high it is. A team of 5 often has 15-20 items in progress.
Step 2: Set the Initial Limit
- Calculate N+2 for your team
- Add the limit to your board (e.g., a column header that says “In Progress (limit: 7)”)
- Agree as a team that when the limit is reached, no new work starts
Step 3: Enforce the Limit
- When someone tries to pull new work and the limit is reached, the team helps them find an existing item to work on
- Track violations: how often does the team exceed the limit? What causes it?
- Discuss in retrospectives: Is the limit too high? Too low? What bottlenecks are exposed?
Step 4: Reduce the Limit (Monthly)
- Every month, consider reducing the limit by 1
- Each reduction will expose new bottlenecks - this is the intended effect
- Stop reducing when the team reaches a sustainable flow where items move from start to done predictably
Key Pitfalls
1. “We set a WIP limit but nobody enforces it”
A WIP limit that is not enforced is not a WIP limit. Enforcement requires a team agreement and a visible mechanism. If the board shows 10 items in progress and the limit is 7, the team should stop and address it immediately. This is a working agreement, not a suggestion.
2. “Developers are idle and management is uncomfortable”
This is the most common failure mode. Management sees “idle” developers and concludes WIP limits are wasteful. In reality, those “idle” developers are either swarming on existing work (which is productive) or the team has hit a genuine bottleneck that needs to be addressed. The discomfort is a signal that the system needs improvement.
3. “We have WIP limits but we also have expedite lanes for everything”
If every urgent request bypasses the WIP limit, you do not have a WIP limit. Expedite lanes should be rare - one per week at most. If everything is urgent, nothing is.
4. “We limit WIP per person but not per team”
Per-person WIP limits miss the point. The goal is to limit team WIP so that team members are incentivized to help each other. A per-person limit of 1 with no team limit still allows the team to have 8 items in progress simultaneously with no swarming.
Measuring Success
| Metric | Target | Why It Matters |
|---|
| Work in progress | At or below team limit | Confirms the limit is being respected |
| Development cycle time | Decreasing | Confirms that less WIP leads to faster delivery |
| Items completed per week | Stable or increasing | Confirms that finishing more, starting less works |
| Time items spend blocked | Decreasing | Confirms bottlenecks are being addressed |
Next Step
WIP limits expose problems. Metrics-Driven Improvement provides the framework for systematically addressing them.
Content contributed by Dojo Consortium, licensed under CC BY 4.0.
Related Content
4 - Metrics-Driven Improvement
Use DORA metrics and improvement kata to drive systematic delivery improvement.
Phase 3 - Optimize | Original content combining DORA recommendations and improvement kata
Improvement without measurement is guesswork. This page combines the DORA four key metrics with the improvement kata pattern to create a systematic, repeatable approach to getting better at delivery.
The Problem with Ad Hoc Improvement
Most teams improve accidentally. Someone reads a blog post, suggests a change at standup, and the team tries it for a week before forgetting about it. This produces sporadic, unmeasurable progress that is impossible to sustain.
Metrics-driven improvement replaces this with a disciplined cycle: measure where you are, define where you want to be, run a small experiment, measure the result, and repeat. The improvement kata provides the structure. DORA metrics provide the measures.
The Four DORA Metrics
The DORA research program (now part of Google Cloud) has identified four key metrics that predict software delivery performance. These are the metrics you should track throughout your CD migration.
1. Deployment Frequency
How often your team deploys to production.
| Performance Level | Deployment Frequency |
|---|
| Elite | On-demand (multiple deploys per day) |
| High | Between once per day and once per week |
| Medium | Between once per week and once per month |
| Low | Between once per month and once every six months |
What it tells you: How comfortable your team and pipeline are with deploying. Low frequency usually indicates manual gates, fear of deployment, or large batch sizes.
How to measure: Count the number of successful deployments to production per unit of time. Automated deploys count. Hotfixes count. Rollbacks do not.
2. Lead Time for Changes
The time from a commit being pushed to trunk to that commit running in production.
| Performance Level | Lead Time |
|---|
| Elite | Less than one hour |
| High | Between one day and one week |
| Medium | Between one week and one month |
| Low | Between one month and six months |
What it tells you: How efficient your pipeline is. Long lead times indicate slow builds, manual approval steps, or infrequent deployment windows.
How to measure: Record the timestamp when a commit merges to trunk and the timestamp when that commit is running in production. The difference is lead time. Track the median, not the mean (outliers distort the mean).
3. Change Failure Rate
The percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, or patch).
| Performance Level | Change Failure Rate |
|---|
| Elite | 0-15% |
| High | 16-30% |
| Medium | 16-30% |
| Low | 46-60% |
What it tells you: How effective your testing and validation pipeline is. High failure rates indicate gaps in test coverage, insufficient pre-production validation, or overly large changes.
How to measure: Track deployments that result in a degraded service, require rollback, or need a hotfix. Divide by total deployments. A “failure” is defined by the team - typically any incident that requires immediate human intervention.
4. Mean Time to Restore (MTTR)
How long it takes to recover from a failure in production.
| Performance Level | Time to Restore |
|---|
| Elite | Less than one hour |
| High | Less than one day |
| Medium | Less than one day |
| Low | Between one week and one month |
What it tells you: How resilient your system and team are. Long recovery times indicate manual rollback processes, poor observability, or insufficient incident response practices.
How to measure: Record the timestamp when a production failure is detected and the timestamp when service is fully restored. Track the median.
CI Health Metrics
DORA metrics are outcome metrics - they tell you how delivery is performing overall. CI health metrics are leading indicators that give you earlier feedback on the health of your integration practices. Problems in these metrics show up days or weeks before they surface in DORA numbers.
Track these alongside DORA metrics to catch issues before they compound.
Commits Per Day Per Developer
| Aspect | Detail |
|---|
| What it measures | The average number of commits integrated to trunk per developer per day |
| How to measure | Count trunk commits (or merged pull requests) over a period and divide by the number of active developers and working days |
| Good target | 2 or more per developer per day |
| Why it matters | Low commit frequency indicates large batch sizes, long-lived branches, or developers waiting to integrate. All of these increase merge risk and slow feedback. |
If the number is low: Developers may be working on branches for too long, bundling unrelated changes into single commits, or facing barriers to integration (slow builds, complex merge processes). Investigate branch lifetimes and work decomposition.
If the number is unusually high: Verify that commits represent meaningful work rather than trivial fixes to pass a metric. Commit frequency is a means to smaller batches, not a goal in itself.
Build Success Rate
| Aspect | Detail |
|---|
| What it measures | The percentage of CI builds that pass on the first attempt |
| How to measure | Divide the number of green builds by total builds over a period |
| Good target | 90% or higher |
| Why it matters | A frequently broken build disrupts the entire team. Developers cannot integrate confidently when the build is unreliable, leading to longer feedback cycles and batching of changes. |
If the number is low: Common causes include flaky tests, insufficient local validation before committing, or environmental inconsistencies between developer machines and CI. Start by identifying and quarantining flaky tests, then ensure developers can run a representative build locally before pushing.
If the number is high but DORA metrics are still lagging: The build may pass but take too long, or the build may not cover enough to catch real problems. Check build duration and test coverage.
Time to Fix a Broken Build
| Aspect | Detail |
|---|
| What it measures | The elapsed time from a build breaking to the next green build on trunk |
| How to measure | Record the timestamp of the first red build and the timestamp of the next green build. Track the median. |
| Good target | Less than 10 minutes |
| Why it matters | A broken build blocks everyone. The longer it stays broken, the more developers stack changes on top of a broken baseline, compounding the problem. Fast fix times are a sign of strong CI discipline. |
If the number is high: The team may not be treating broken builds as a stop-the-line event. Establish a team agreement: when the build breaks, fixing it takes priority over all other work. If builds break frequently and take long to fix, reduce change size so failures are easier to diagnose.
The DORA Recommended Practices
Behind these four metrics are 24 practices that the DORA research has shown to drive performance. They organize into five categories. Use this as a diagnostic tool: when a metric is lagging, look at the related practices to identify what to improve.
Continuous Delivery Practices
These directly affect your pipeline and deployment practices:
- Version control for all production artifacts
- Automated deployment processes
- Continuous integration
- Trunk-based development
- Test automation
- Test data management
- Shift-left security
- Continuous delivery (the ability to deploy at any time)
Architecture Practices
These affect how easily your system can be changed and deployed:
- Loosely coupled architecture
- Empowered teams that can choose their own tools
- Teams that can test, deploy, and release independently
Product and Process Practices
These affect how work flows through the team:
- Customer feedback loops
- Value stream visibility
- Working in small batches
- Team experimentation
Lean Management Practices
These affect how the organization supports delivery:
- Lightweight change approval processes
- Monitoring and observability
- Proactive notification
- WIP limits
- Visual management of workflow
Cultural Practices
These affect the environment in which teams operate:
- Generative organizational culture (Westrum model)
- Encouraging and supporting learning
- Collaboration within and between teams
- Job satisfaction
- Transformational leadership
For a detailed breakdown, see the DORA Recommended Practices reference.
The Improvement Kata
The improvement kata is a four-step pattern from lean manufacturing adapted for software delivery. It provides the structure for turning DORA measurements into concrete improvements.
Step 1: Understand the Direction
Where does your CD migration need to go?
This is already defined by the phases of this migration guide. In Phase 3, your direction is: smaller batches, faster flow, and higher confidence in every deployment.
Step 2: Grasp the Current Condition
Measure your current DORA metrics. Be honest - the point is to understand reality, not to look good.
Practical approach:
- Collect two weeks of data for all four DORA metrics
- Plot the data - do not just calculate averages. Look at the distribution.
- Identify which metric is furthest from your target
- Investigate the related practices to understand why
Example current condition:
| Metric | Current | Target | Gap |
|---|
| Deployment frequency | Weekly | Daily | 5x improvement needed |
| Lead time | 3 days | < 1 day | Pipeline is slow or has manual gates |
| Change failure rate | 25% | < 15% | Test coverage or change size issue |
| MTTR | 4 hours | < 1 hour | Rollback is manual |
Step 3: Establish the Next Target Condition
Do not try to fix everything at once. Pick one metric and define a specific, measurable, time-bound target.
Good target: “Reduce lead time from 3 days to 1 day within the next 4 weeks.”
Bad target: “Improve our deployment pipeline.” (Too vague, no measure, no deadline.)
Step 4: Experiment Toward the Target
Design a small experiment that you believe will move the metric toward the target. Run it. Measure the result. Adjust.
The experiment format:
| Element | Description |
|---|
| Hypothesis | “If we [action], then [metric] will [improve/decrease] because [reason].” |
| Action | What specifically will you change? |
| Duration | How long will you run the experiment? (Typically 1-2 weeks) |
| Measure | How will you know if it worked? |
| Decision criteria | What result would cause you to keep, modify, or abandon the change? |
Example experiment:
Hypothesis: If we parallelize our integration test suite, lead time will drop from 3 days to under 2 days because 60% of lead time is spent waiting for tests to complete.
Action: Split the integration test suite into 4 parallel runners.
Duration: 2 weeks.
Measure: Median lead time for commits merged during the experiment period.
Decision criteria: Keep if lead time drops below 2 days. Modify if it drops but not enough. Abandon if it has no effect or introduces flakiness.
The Cycle Repeats
After each experiment:
- Measure the result
- Update your understanding of the current condition
- If the target is met, pick the next metric to improve
- If the target is not met, design another experiment
This creates a continuous improvement loop. Each cycle takes 1-2 weeks. Over months, the cumulative effect is dramatic.
Connecting Metrics to Action
When a metric is lagging, use this guide to identify where to focus.
Low Deployment Frequency
| Possible Cause | Investigation | Action |
|---|
| Manual approval gates | Map the approval chain | Automate or eliminate non-value-adding approvals |
| Fear of deployment | Ask the team what they fear | Address the specific fear (usually testing gaps) |
| Large batch size | Measure changes per deploy | Implement small batches practices |
| Deploy process is manual | Time the deploy process | Automate the deployment pipeline |
Long Lead Time
| Possible Cause | Investigation | Action |
|---|
| Slow builds | Time each pipeline stage | Optimize the slowest stage (often tests) |
| Waiting for environments | Track environment wait time | Implement self-service environments |
| Waiting for approval | Track approval wait time | Reduce approval scope or automate |
| Large changes | Measure commit size | Reduce batch size |
High Change Failure Rate
| Possible Cause | Investigation | Action |
|---|
| Insufficient test coverage | Measure coverage by area | Add tests for the areas that fail most |
| Tests pass but production differs | Compare test and prod environments | Make environments more production-like |
| Large, risky changes | Measure change size | Reduce batch size, use feature flags |
| Configuration drift | Audit configuration differences | Externalize and version configuration |
Long MTTR
| Possible Cause | Investigation | Action |
|---|
| Rollback is manual | Time the rollback process | Automate rollback |
| Hard to identify root cause | Review recent incidents | Improve observability and alerting |
| Hard to deploy fixes quickly | Measure fix lead time | Ensure pipeline supports rapid hotfix deployment |
| Dependencies fail in cascade | Map failure domains | Improve architecture decoupling |
Pipeline Visibility
Metrics only drive improvement when people see them. Pipeline visibility means making the current state of your build and deployment pipeline impossible to ignore. When the build is red, everyone should know immediately - not when someone checks a dashboard twenty minutes later.
Making Build Status Visible
The most effective teams use ambient visibility - information that is passively available without anyone needing to seek it out.
Build radiators: A large monitor in the team area showing the current pipeline status. Green means the build is passing. Red means it is broken. The radiator should be visible from every desk in the team space. For remote teams, a persistent widget in the team chat channel serves the same purpose.
Browser extensions and desktop notifications: Tools like CCTray, BuildNotify, or CI server plugins can display build status in the system tray or browser toolbar. These provide individual-level ambient awareness without requiring a shared physical space.
Chat integrations: Post build results to the team channel automatically. Keep these concise - a green checkmark or red alert with a link to the build is enough. Verbose build logs in chat become noise.
Notification Best Practices
Notifications are powerful when used well and destructive when overused. The goal is to notify the right people at the right time with the right level of urgency.
When to notify:
- Build breaks on trunk - notify the whole team immediately
- Build is fixed - notify the whole team (this is a positive signal worth reinforcing)
- Deployment succeeds - notify the team channel (low urgency)
- Deployment fails - notify the on-call and the person who triggered it
When not to notify:
- Every commit or pull request update (too noisy)
- Successful builds on feature branches (nobody else needs to know)
- Metrics that have not changed (no signal in “things are the same”)
Avoiding notification fatigue: If your team ignores notifications, you have too many of them. Audit your notification channels quarterly. Remove any notification that the team consistently ignores. A notification that nobody reads is worse than no notification at all - it trains people to tune out the channel entirely.
Building a Metrics Dashboard
Make your DORA metrics and CI health metrics visible to the team at all times. A dashboard on a wall monitor or a shared link is ideal.
Organize your dashboard around three categories:
Current status - what is happening right now:
- Pipeline status (green/red) for trunk and any active deployments
- Current values for all four DORA metrics
- Active experiment description and target condition
Trends - where are we heading:
- Trend lines showing direction over the past 4-8 weeks
- CI health metrics (build success rate, time to fix, commit frequency) plotted over time
- Whether the current improvement target is on track
Team health - how is the team doing:
- Current improvement target highlighted
- Days since last production incident
- Number of experiments completed this quarter
Dashboard Anti-Patterns
The vanity dashboard: Displays only metrics that look good. If your dashboard never shows anything concerning, it is not useful. Include metrics that challenge the team, not just ones that reassure management.
The everything dashboard: Crams dozens of metrics, charts, and tables onto one screen. Nobody can parse it at a glance, so nobody looks at it. Limit your dashboard to 6-8 key indicators. If you need more detail, put it on a drill-down page.
The stale dashboard: Data is updated manually and falls behind. Automate data collection wherever possible. A dashboard showing last month’s numbers is worse than no dashboard - it creates false confidence.
The blame dashboard: Ties metrics to individual developers rather than teams. This creates fear and gaming rather than improvement. Always present metrics at the team level.
Keep it simple. A spreadsheet updated weekly is better than a sophisticated dashboard that nobody maintains. The goal is visibility, not tooling sophistication.
Key Pitfalls
1. “We measure but don’t act”
Measurement without action is waste. If you collect metrics but never run experiments, you are creating overhead with no benefit. Every measurement should lead to a hypothesis. Every hypothesis should lead to an experiment. See Hypothesis-Driven Development for the full lifecycle.
2. “We use metrics to compare teams”
DORA metrics are for teams to improve themselves, not for management to rank teams. Using metrics for comparison creates incentives to game the numbers. Each team should own its own metrics and its own improvement targets.
3. “We try to improve all four metrics at once”
Focus on one metric at a time. Improving deployment frequency and change failure rate simultaneously often requires conflicting actions. Pick the biggest bottleneck, address it, then move to the next.
4. “We abandon experiments too quickly”
Most experiments need at least two weeks to show results. One bad day is not a reason to abandon an experiment. Set the duration up front and commit to it.
Measuring Success
| Indicator | Target | Why It Matters |
|---|
| Experiments per month | 2-4 | Confirms the team is actively improving |
| Metrics trending in the right direction | Consistent improvement over 3+ months | Confirms experiments are having effect |
| Team can articulate current condition and target | Everyone on the team knows | Confirms improvement is a shared concern |
| Improvement items in backlog | Always present | Confirms improvement is treated as a deliverable |
Next Step
Metrics tell you what to improve. Retrospectives provide the team forum for deciding how to improve it.
Related Content
5 - Retrospectives
Continuously improve the delivery process through structured reflection.
Phase 3 - Optimize
A retrospective is the team’s primary mechanism for turning observations into improvements. Without effective retrospectives, WIP limits expose problems that nobody addresses, metrics trend in the wrong direction with no response, and the CD migration stalls.
Why Retrospectives Matter for CD Migration
Every practice in this guide - trunk-based development, small batches, WIP limits, metrics-driven improvement - generates signals about what is working and what is not. Retrospectives are where the team processes those signals and decides what to change.
Teams that skip retrospectives or treat them as a checkbox exercise consistently stall at whatever maturity level they first reach. Teams that run effective retrospectives continuously improve, week after week, month after month.
The Five-Part Structure
An effective retrospective follows a structured format that prevents it from devolving into a venting session or a status meeting. This five-part structure ensures the team moves from observation to action.
Part 1: Review the Mission (5 minutes)
Start by reminding the team of the larger goal. In the context of a CD migration, this might be:
- “Our mission this quarter is to deploy to production at least once per day.”
- “We are working toward eliminating manual gates in our pipeline.”
- “Our goal is to reduce lead time from 3 days to under 1 day.”
This grounding prevents the retrospective from focusing on minor irritations and keeps the conversation aligned with what matters.
Part 2: Review the KPIs (10 minutes)
Present the team’s current metrics. For a CD migration, these are typically the DORA metrics plus any team-specific measures from Metrics-Driven Improvement.
| Metric | Last Period | This Period | Trend |
|---|
| Deployment frequency | 3/week | 4/week | Improving |
| Lead time (median) | 2.5 days | 2.1 days | Improving |
| Change failure rate | 22% | 18% | Improving |
| MTTR | 3 hours | 3.5 hours | Declining |
| WIP (average) | 8 items | 6 items | Improving |
Do not skip this step. Without data, the retrospective becomes a subjective debate where the loudest voice wins. With data, the conversation focuses on what the numbers show and what to do about them.
Part 3: Review Experiments (10 minutes)
Review the outcomes of any experiments the team ran since the last retrospective.
For each experiment:
- What was the hypothesis? Remind the team what you were testing.
- What happened? Present the data.
- What did you learn? Even failed experiments teach you something.
- What is the decision? Keep, modify, or abandon.
Example:
Experiment: Parallelize the integration test suite to reduce lead time.
Hypothesis: Lead time would drop from 2.5 days to under 2 days.
Result: Lead time dropped to 2.1 days. The parallelization worked, but environment setup time is now the bottleneck.
Decision: Keep the parallelization. New experiment: investigate self-service test environments.
Part 4: Check Goals (10 minutes)
Review any improvement goals or action items from the previous retrospective.
- Completed: Acknowledge and celebrate. This is important - it reinforces that improvement work matters.
- In progress: Check for blockers. Does the team need to adjust the approach?
- Not started: Why not? Was it deprioritized, blocked, or forgotten? If improvement work is consistently not started, the team is not treating improvement as a deliverable (see below).
Part 5: Open Conversation (25 minutes)
This is the core of the retrospective. The team discusses:
- What is working well that we should keep doing?
- What is not working that we should change?
- What new problems or opportunities have we noticed?
Facilitation techniques for this section:
| Technique | How It Works | Best For |
|---|
| Start/Stop/Continue | Each person writes items in three categories | Quick, structured, works with any team |
| 4Ls (Liked, Learned, Lacked, Longed For) | Broader categories that capture emotional responses | Teams that need to process frustration or celebrate wins |
| Timeline | Plot events on a timeline and discuss turning points | After a particularly eventful sprint or incident |
| Dot voting | Everyone gets 3 votes to prioritize discussion topics | When there are many items and limited time |
From Conversation to Commitment
The open conversation must produce concrete action items. Vague commitments like “we should communicate better” are worthless. Good action items are:
- Specific: “Add a Slack notification when the build breaks” (not “improve communication”)
- Owned: “Alex will set this up by Wednesday” (not “someone should do this”)
- Measurable: “We will know this worked if build break response time drops below 10 minutes”
- Time-bound: “We will review the result at the next retrospective”
Limit action items to 1-3 per retrospective. More than three means nothing gets done. One well-executed improvement is worth more than five abandoned ones.
Psychological Safety Is a Prerequisite
A retrospective only works if team members feel safe to speak honestly about what is not working. Without psychological safety, retrospectives produce sanitized, non-actionable discussion.
Signs of Low Psychological Safety
- Only senior team members speak
- Nobody mentions problems - everything is “fine”
- Issues that everyone knows about are never raised
- Team members vent privately after the retrospective instead of during it
- Action items are always about tools or processes, never about behaviors
Building Psychological Safety
| Practice | Why It Helps |
|---|
| Leader speaks last | Prevents the leader’s opinion from anchoring the discussion |
| Anonymous input | Use sticky notes or digital tools where input is anonymous initially |
| Blame-free language | “The deploy failed” not “You broke the deploy” |
| Follow through on raised issues | Nothing destroys safety faster than raising a concern and having it ignored |
| Acknowledge mistakes openly | Leaders who admit their own mistakes make it safe for others to do the same |
| Separate retrospective from performance review | If retro content affects reviews, people will not be honest |
Treat Improvement as a Deliverable
The most common failure mode for retrospectives is producing action items that never get done. This happens when improvement work is treated as something to do “when we have time” - which means never.
Make Improvement Visible
- Add improvement items to the same board as feature work
- Include improvement items in WIP limits
- Track improvement items through the same workflow as any other deliverable
Allocate Capacity
Reserve a percentage of team capacity for improvement work. Common allocations:
| Allocation | Approach |
|---|
| 20% continuous | One day per week (or equivalent) dedicated to improvement, tooling, and tech debt |
| Dedicated improvement sprint | Every 4th sprint is entirely improvement-focused |
| Improvement as first pull | When someone finishes work and the WIP limit allows, the first option is an improvement item |
The specific allocation matters less than having one. A team that explicitly budgets 10% for improvement will improve more than a team that aspires to 20% but never protects the time.
Retrospective Cadence
| Cadence | Best For | Caution |
|---|
| Weekly | Teams in active CD migration, teams working through major changes | Can feel like too many meetings if not well-facilitated |
| Bi-weekly | Teams in steady state with ongoing improvement | Most common cadence |
| After incidents | Any team | Incident retrospectives (postmortems) are separate from regular retrospectives |
| Monthly | Mature teams with well-established improvement habits | Too infrequent for teams early in their migration |
During active phases of a CD migration (Phases 1-3), weekly retrospectives are recommended. Once the team reaches Phase 4, bi-weekly is usually sufficient.
Running Your First CD Migration Retrospective
If your team has not been running effective retrospectives, start here:
Before the Retrospective
- Collect your DORA metrics for the past two weeks
- Review any action items from the previous retrospective (if applicable)
- Prepare a shared document or board with the five-part structure
During the Retrospective (60 minutes)
- Review mission (5 min): State your CD migration goal for this phase
- Review KPIs (10 min): Present the DORA metrics. Ask: “What do you notice?”
- Review experiments (10 min): Discuss any experiments that were run
- Check goals (10 min): Review action items from last time
- Open conversation (25 min): Use Start/Stop/Continue for the first time - it is the simplest format
After the Retrospective
- Publish the action items where the team will see them daily
- Assign owners and due dates
- Add improvement items to the team board
- Schedule the next retrospective
Key Pitfalls
1. “Our retrospectives always produce the same complaints”
If the same issues surface repeatedly, the team is not executing on its action items. Check whether improvement work is being prioritized alongside feature work. If it is not, no amount of retrospective technique will help.
2. “People don’t want to attend because nothing changes”
This is a symptom of the same problem - action items are not executed. The fix is to start small: commit to one action item per retrospective, execute it completely, and demonstrate the result at the next retrospective. Success builds momentum.
3. “The retrospective turns into a blame session”
The facilitator must enforce blame-free language. Redirect “You did X wrong” to “When X happened, the impact was Y. How can we prevent Y?” If blame is persistent, the team has a psychological safety problem that needs to be addressed separately.
4. “We don’t have time for retrospectives”
A team that does not have time to improve will never improve. A 60-minute retrospective that produces one executed improvement is the highest-leverage hour of the entire sprint.
Measuring Success
| Indicator | Target | Why It Matters |
|---|
| Retrospective attendance | 100% of team | Confirms the team values the practice |
| Action items completed | > 80% completion rate | Confirms improvement is treated as a deliverable |
| DORA metrics trend | Improving quarter over quarter | Confirms retrospectives lead to real improvement |
| Team engagement | Voluntary contributions increasing | Confirms psychological safety is present |
Next Step
With metrics-driven improvement and effective retrospectives, you have the engine for continuous improvement. The final optimization step is Architecture Decoupling - ensuring your system’s architecture does not prevent you from deploying independently.
Content contributed by Dojo Consortium, licensed under CC BY 4.0.
Related Content
6 - Architecture Decoupling
Enable independent deployment of components by decoupling architecture boundaries.
Phase 3 - Optimize | Original content based on Dojo Consortium delivery journey patterns
You cannot deploy independently if your architecture requires coordinated releases. This page describes the three architecture states teams encounter on the journey to continuous deployment and provides practical strategies for moving from entangled to loosely coupled.
Why Architecture Matters for CD
Every practice in this guide - small batches, feature flags, WIP limits - assumes that your team can deploy its changes independently. But if your application is a monolith where changing one module requires retesting everything, or a set of microservices with tightly coupled APIs, independent deployment is impossible regardless of how good your practices are.
Architecture is either an enabler or a blocker for continuous deployment. There is no neutral.
Three Architecture States
The Delivery System Improvement Journey describes three states that teams move through. Most teams start entangled. The goal is to reach loosely coupled.
State 1: Entangled
In an entangled architecture, everything is connected to everything. Changes in one area routinely break other areas. Teams cannot deploy independently.
Characteristics:
- Shared database schemas with no ownership boundaries
- Circular dependencies between modules or services
- Deploying one service requires deploying three others at the same time
- Integration testing requires the entire system to be running
- A single team’s change can block every other team’s release
- “Big bang” releases on a fixed schedule
Impact on delivery:
| Metric | Typical State |
|---|
| Deployment frequency | Monthly or quarterly (because coordinating releases is hard) |
| Lead time | Weeks to months (because changes wait for the next release train) |
| Change failure rate | High (because big releases mean big risk) |
| MTTR | Long (because failures cascade across boundaries) |
How you got here: Entanglement is the natural result of building quickly without deliberate architectural boundaries. It is not a failure - it is a stage that almost every system passes through.
State 2: Tightly Coupled
In a tightly coupled architecture, there are identifiable boundaries between components, but those boundaries are leaky. Teams have some independence, but coordination is still required for many changes.
Characteristics:
- Services exist but share a database or use synchronous point-to-point calls
- API contracts exist but are not versioned - breaking changes require simultaneous updates
- Teams can deploy some changes independently, but cross-cutting changes require coordination
- Integration testing requires multiple services but not the entire system
- Release trains still exist but are smaller and more frequent
Impact on delivery:
| Metric | Typical State |
|---|
| Deployment frequency | Weekly to bi-weekly |
| Lead time | Days to a week |
| Change failure rate | Moderate (improving but still affected by coupling) |
| MTTR | Hours (failures are more isolated but still cascade sometimes) |
State 3: Loosely Coupled
In a loosely coupled architecture, components communicate through well-defined interfaces, own their own data, and can be deployed independently without coordinating with other teams.
Characteristics:
- Each service owns its own data store - no shared databases
- APIs are versioned; consumers and producers can be updated independently
- Asynchronous communication (events, queues) is used where possible
- Each team can deploy without coordinating with any other team
- Services are designed to degrade gracefully if a dependency is unavailable
- No release trains - each team deploys when ready
Impact on delivery:
| Metric | Typical State |
|---|
| Deployment frequency | On-demand (multiple times per day) |
| Lead time | Hours |
| Change failure rate | Low (small, isolated changes) |
| MTTR | Minutes (failures are contained within service boundaries) |
Moving from Entangled to Tightly Coupled
This is the first and most difficult transition. It requires establishing boundaries where none existed before.
Strategy 1: Identify Natural Seams
Look for places where the system already has natural boundaries, even if they are not enforced:
- Different business domains: Orders, payments, inventory, and user accounts are different domains even if they live in the same codebase.
- Different rates of change: Code that changes weekly and code that changes yearly should not be in the same deployment unit.
- Different scaling needs: Components with different load profiles benefit from separate deployment.
- Different team ownership: If different teams work on different parts of the codebase, those parts are candidates for separation.
Strategy 2: Strangler Fig Pattern
Instead of rewriting the system, incrementally extract components from the monolith.
Step 1: Route all traffic through a facade/proxy
Step 2: Build the new component alongside the old
Step 3: Route a small percentage of traffic to the new component
Step 4: Validate correctness and performance
Step 5: Route all traffic to the new component
Step 6: Remove the old code
Key rule: The strangler fig pattern must be done incrementally. If you try to extract everything at once, you are doing a rewrite, not a strangler fig.
Strategy 3: Define Ownership Boundaries
Assign clear ownership of each module or component to a single team. Ownership means:
- The owning team decides the API contract
- The owning team deploys the component
- Other teams consume the API, not the internal implementation
- Changes to the API contract require agreement from consumers (but not simultaneous deployment)
What to Avoid
- The “big rewrite”: Rewriting a monolith from scratch almost always fails. Use the strangler fig pattern instead.
- Premature microservices: Do not split into microservices until you have clear domain boundaries and team ownership. Microservices with unclear boundaries are a distributed monolith - the worst of both worlds.
- Shared databases across services: This is the most common coupling mechanism. If two services share a database, they cannot be deployed independently because a schema change in one service can break the other.
Moving from Tightly Coupled to Loosely Coupled
This transition is about hardening the boundaries that were established in the previous step.
Strategy 1: Eliminate Shared Data Stores
If two services share a database, one of three things needs to happen:
- One service owns the data, the other calls its API. The dependent service no longer accesses the database directly.
- The data is duplicated. Each service maintains its own copy, synchronized via events.
- The shared data becomes a dedicated data service. Both services consume from a service that owns the data.
Strategy 2: Version Your APIs
API versioning allows consumers and producers to evolve independently.
Rules for API versioning:
- Never make a breaking change without a new version. Adding fields is non-breaking. Removing fields is breaking. Changing field types is breaking.
- Support at least two versions simultaneously. This gives consumers time to migrate.
- Deprecate old versions with a timeline. “Version 1 will be removed on date X.”
- Use consumer-driven contract tests to verify compatibility. See Contract Testing.
Strategy 3: Prefer Asynchronous Communication
Synchronous calls (HTTP, gRPC) create temporal coupling: if the downstream service is slow or unavailable, the upstream service is also affected.
| Communication Style | Coupling | When to Use |
|---|
| Synchronous (HTTP/gRPC) | Temporal + behavioral | When the caller needs an immediate response |
| Asynchronous (events/queues) | Behavioral only | When the caller does not need an immediate response |
| Event-driven (publish/subscribe) | Minimal | When the producer does not need to know about consumers |
Prefer asynchronous communication wherever the business requirements allow it. Not every interaction needs to be synchronous.
Strategy 4: Design for Failure
In a loosely coupled system, dependencies will be unavailable sometimes. Design for this:
- Circuit breakers: Stop calling a failing dependency after N failures. Return a degraded response instead.
- Timeouts: Set aggressive timeouts on all external calls. A 30-second timeout on a service that should respond in 100ms is not a timeout - it is a hang.
- Bulkheads: Isolate failures so that one failing dependency does not consume all resources.
- Graceful degradation: Define what the user experience should be when a dependency is down. “Recommendations unavailable” is better than a 500 error.
Practical Steps for Architecture Decoupling
Step 1: Map Dependencies
Before changing anything, understand what you have:
- Draw a dependency graph. Which components depend on which? Where are the shared databases?
- Identify deployment coupling. Which components must be deployed together? Why?
- Identify the highest-impact coupling. Which coupling most frequently blocks independent deployment?
Step 2: Establish the First Boundary
Pick one component to decouple. Choose the one with the highest impact and lowest risk:
- Apply the strangler fig pattern to extract it
- Define a clear API contract
- Move its data to its own data store
- Deploy it independently
Step 3: Repeat
Take the next highest-impact coupling and address it. Each decoupling makes the next one easier because the team learns the patterns and the remaining system is simpler.
Key Pitfalls
1. “We need to rewrite everything before we can deploy independently”
No. Decoupling is incremental. Extract one component, deploy it independently, prove the pattern works, then continue. A partial decoupling that enables one team to deploy independently is infinitely more valuable than a planned rewrite that never finishes.
2. “We split into microservices but our lead time got worse”
Microservices add operational complexity (more services to deploy, monitor, and debug). If you split without investing in deployment automation, observability, and team autonomy, you will get worse, not better. Microservices are a tool for organizational scaling, not a silver bullet for delivery speed.
3. “Teams keep adding new dependencies that recouple the system”
Architecture decoupling requires governance. Establish architectural principles (e.g., “no shared databases”) and enforce them through automated checks (e.g., dependency analysis in CI) and architecture reviews for cross-boundary changes.
4. “We can’t afford the time to decouple”
You cannot afford not to. Every week spent doing coordinated releases is a week of delivery capacity lost to coordination overhead. The investment in decoupling pays for itself quickly through increased deployment frequency and reduced coordination cost.
Measuring Success
| Metric | Target | Why It Matters |
|---|
| Teams that can deploy independently | Increasing | The primary measure of decoupling |
| Coordinated releases per quarter | Decreasing toward zero | Confirms coupling is being eliminated |
| Deployment frequency per team | Increasing independently | Confirms teams are not blocked by each other |
| Cross-team dependencies per feature | Decreasing | Confirms architecture supports independent work |
Next Step
With optimized flow, small batches, metrics-driven improvement, and a decoupled architecture, your team is ready for the final phase. Continue to Phase 4: Deliver on Demand.
Related Content
7 - Team Alignment to Code
Match team ownership boundaries to code boundaries so each team can build, test, and deploy its domain independently.
Phase 3 - Optimize | Teams that own a domain end-to-end can deploy independently. Teams organized around technical layers cannot.
How Team Structure Shapes Code
The way an organization communicates produces the architecture it builds. When communication flows
between layers - frontend team talks to backend team, backend team talks to database team - the
software reflects those communication lines. Requests for the UI layer go to one team. Requests for
the API layer go to another. The result is software that is horizontally layered in the same pattern
as the organization.
Layer teams produce layered architectures. The layers are coupled not because the engineers chose
to couple them but because every feature requires coordination across team boundaries. The coupling
is structural, not accidental.
Domain teams produce domain boundaries. When one team owns everything inside a business domain -
the user interface, the business logic, the data store, and the deployment pipeline - they can
make changes within that domain without coordinating with other teams. The interfaces between
domains are explicit and stable because that is how the teams communicate.
This is not a coincidence. Architecture reflects the ownership structure of the people who built
it.
What Aligned Ownership Looks Like
A team with aligned ownership can answer yes to all of the following:
- Can this team deploy a change to production without waiting for another team?
- Does this team own everything inside its domain boundary - all layers, all data, and all consumer interfaces?
- Does this team define and version the contracts its domain exposes to other domains?
- Is this team responsible for production incidents in its domain?
Two team patterns achieve aligned ownership in practice.
A full-stack product team owns the complete user-facing surface for a feature area - from
the UI components a user interacts with down through the business logic and the database. The team
has no hard dependency on a separate frontend or backend team. One team ships the entire vertical
slice.
A subdomain product team owns a service or set of services representing a bounded business
capability. Some subdomain teams own a user-facing surface alongside their backend logic. Others -
a tax calculation service, a shipping rates engine, an identity provider - have no UI at all.
Their consumer interface is entirely an API, consumed by other teams rather than by end users
directly. Both are fully aligned: the team owns everything within the boundary, and the boundary
is what its consumers depend on - whether that is a UI, an API, or both. A slice is done when the
consumer interface satisfies the agreed behavior for its callers.
Both patterns share the same structure: one team, one deployable, full ownership. The team
owns all layers within its boundary, the authority to deploy that boundary independently, and
accountability for its operational behavior.
What Misalignment Looks Like
Three patterns consistently produce deployment coupling.
Component or layer teams. A frontend team, a backend team, and a database team all work on the
same product. Every feature requires coordination across all three. No team can deploy
independently because no team owns a full vertical slice.
Feature teams without domain ownership. Teams are organized around feature areas, but each
feature area spans multiple services owned by other teams. The feature team coordinates with
service owners for every change. The service owners become a shared resource that feature teams
queue against.
The pillar model. A platform team owns all infrastructure. A shared services team owns
cross-cutting concerns. Product teams own the business logic but depend on the other two for
deployment. A change that touches infrastructure or shared services requires the product team to
file a ticket and wait.
The telltale sign in all three cases: a team cannot estimate their own delivery date because it
depends on other teams’ schedules.
The Relationship Between Team Alignment and Architecture
Team alignment and architecture reinforce each other. A decoupled architecture makes it possible
to draw clean team boundaries. Clean team boundaries prevent the architecture from recoupling.
When team boundaries and code boundaries match:
- Each team modifies code that only they own. Merge conflicts between teams disappear.
- Each team’s pipeline validates only their domain. Shared pipeline queues disappear.
- Each team deploys on their own schedule. Release trains disappear.
When they do not match, architecture and ownership drift together. A team that technically “owns”
a service but in practice coordinates with three other teams for every change is not an independent
deployment unit regardless of what the org chart says.
See Architecture Decoupling for the technical strategies to establish
independent service boundaries. See Tightly Coupled Monolith
for the architecture anti-pattern that misaligned ownership produces over time.
graph TD
classDef aligned fill:#0d7a32,stroke:#0a6128,color:#fff
classDef misaligned fill:#a63123,stroke:#8a2518,color:#fff
classDef boundary fill:#224968,stroke:#1a3a54,color:#fff
subgraph good ["Aligned: Domain Teams"]
G1["Payments Team\nUI + Logic + DB + Pipeline"]:::aligned
G2["Inventory Team\nUI + Logic + DB + Pipeline"]:::aligned
G3["Accounts Team\nUI + Logic + DB + Pipeline"]:::aligned
G4["Stable API Contracts"]:::boundary
G1 --> G4
G2 --> G4
G3 --> G4
end
subgraph bad ["Misaligned: Layer Teams"]
L1["Frontend Team\nAll UI across all domains"]:::misaligned
L2["Backend Team\nAll logic across all domains"]:::misaligned
L3["Database Team\nAll data across all domains"]:::misaligned
L4["Coordinated Release Required"]:::boundary
L1 --> L4
L2 --> L4
L3 --> L4
endHow to Align Teams to Code
Step 1: Map who modifies what
Before changing anything, understand the actual ownership pattern. Use commit history to identify
which teams (or individuals acting as de facto teams) modify which files and services.
- Pull commit history for the last three months:
git log --format="%ae %f" | sort | uniq -c - Map authors to their team. Identify the files each team touches most.
- Highlight files that multiple teams touch frequently. These are the coupling points.
- Identify services or modules where changes from one team consistently require changes from another.
The result is a map of actual ownership versus nominal ownership. In most organizations these
diverge significantly.
Step 2: Identify natural domain boundaries
Natural domain boundaries exist in most codebases - they are just not enforced by team structure.
Look for:
- Business capabilities. What does this system do? Separate business functions - billing,
shipping, authentication, reporting - that could be operated independently are candidate domains.
- Data ownership. Which tables or data stores does each part of the system read and write?
Data that is exclusively owned by one functional area belongs in that domain.
- Rate of change. Code that changes weekly for business reasons and code that changes monthly
for infrastructure reasons should be in different domains with different teams.
- Existing team knowledge. Where do engineers already have strong concentrated expertise?
Domain boundaries often match knowledge boundaries.
Draw a candidate domain map. Each domain should be a bounded set of business capability that one
team can own end-to-end. Do not force domains to map to the current team structure - let the
business capabilities define the boundaries first.
Step 3: Assign end-to-end ownership
For each candidate domain identified in Step 2, assign a single team. The rules:
- One team per domain. Shared ownership produces neither ownership. If a domain has two owners,
pick one.
- Full stack. The owning team is responsible for all layers within the domain - UI, logic, data.
If the current team lacks skills at some layer, plan for cross-training or re-staffing, but do
not address the skill gap by keeping a separate layer team.
- Deployment authority. The owning team merges to trunk and controls the deployment pipeline for
their domain. No other team can block their deployment.
- Operational accountability. The owning team is paged for production issues in their domain.
On-call for the domain is owned by the same people who build it.
Document the domain boundaries explicitly: what services, data stores, and interfaces belong to
each team.
Step 4: Define contracts at boundaries
Once teams own their domains, the interfaces between domains must be made explicit. Implicit
interfaces - shared databases, undocumented internal calls, assumed response shapes - break
independent deployment.
For each boundary between domains:
- API contracts. Define the request and response shapes the consuming team depends on.
Use OpenAPI or an equivalent schema. Commit it to the producer’s repository.
- Event contracts. For asynchronous communication, define the event schema and the guarantees
the producer makes (ordering, at-least-once vs. exactly-once, schema evolution rules).
- Versioning. Establish a versioning policy. Additive changes are non-breaking. Removing or
changing field semantics requires a new version. Both old and new versions are supported for a
defined deprecation period.
- Contract tests. Write tests that verify the producer honors the contract. Write tests that
verify the consumer handles the contract correctly. See Contract Testing
for implementation guidance.
Teams should not proceed to separate deployment pipelines until contracts are explicit and tested.
An implicit contract that breaks silently is worse than a coordinated deployment.
Step 5: Separate deployment pipelines
With explicit contracts in place, each team can operate an independent pipeline for their domain.
- Each team’s pipeline validates only their domain’s tests and contracts.
- Pipeline triggers are scoped to the files the team owns - changes to another domain’s files do
not trigger this team’s pipeline.
- Each team deploys from their pipeline on their own schedule, without waiting for other teams.
For teams that share a repository but own distinct domains, use path-filtered triggers and separate
pipeline configurations. See Multiple Teams, Single Deployable
for a worked example of this pattern when teams share a modular monolith.
| Objection | Response |
|---|
| “We don’t have enough senior engineers to staff every domain team fully.” | Domain teams do not need to be large. A team of two to three engineers with full ownership of a well-scoped domain delivers faster than six engineers on a layer team waiting for each other. Start with the highest-priority domains and staff others incrementally. |
| “Our engineers are specialists. The frontend people can’t own database code.” | Ownership does not require equal expertise at every layer - it requires the team to be responsible and to develop capability over time. Pair frontend specialists with backend engineers on the same team. The skill gap closes faster inside a team than across team boundaries. |
| “We tried domain teams before and they reinvented everything separately.” | Reinvention happens when platform capabilities are not shared effectively, not because of domain ownership. Separate domain ownership (what business capabilities each team is responsible for) from platform ownership (shared infrastructure, frameworks, and observability tooling). |
| “Business stakeholders are used to requesting work from the layer teams.” | Stakeholders adapt quickly when domain teams ship faster and with less coordination. Reframe the conversation: stakeholders talk to the team that owns the outcome, not the team that owns the layer. |
| “Our architecture doesn’t have clean domain boundaries yet.” | Start with the organizational change anyway. Teams aligned to emerging domain boundaries will drive the architectural cleanup faster than a centralized architecture effort without aligned ownership. The two reinforce each other. |
Measuring Success
| Metric | Target | Why It Matters |
|---|
| Deployment frequency per team | Increasing per team | Confirms teams can deploy without waiting for others |
| Cross-team dependencies per feature | Decreasing toward zero | Confirms domain boundaries are holding |
| Development cycle time | Decreasing | Teams that own their domain wait on fewer external dependencies |
| Production incidents attributed to another team’s change | Decreasing | Confirms ownership boundaries match deployment boundaries |
| Teams blocked on a release window they did not control | Decreasing toward zero | The primary organizational symptom of misalignment |
Related Content
8 - Hypothesis-Driven Development
Treat every change as an experiment with a predicted outcome, measure the result, and adjust future work based on evidence.
Phase 3 - Optimize
Hypothesis-driven development treats every change as an experiment. Instead of building features because someone asked for them and hoping they help, teams state a predicted outcome before writing code, measure the result after deployment, and use the evidence to decide what to do next. Combined with feature flags, small batches, and metrics-driven improvement, this practice closes the loop between shipping and learning.
Why Hypothesis-Driven Development
Most teams ship features without stating what outcome they expect. A product manager requests a feature, developers build it, and everyone moves on to the next item. Weeks later, nobody checks whether the feature actually helped.
This is waste. Teams accumulate features without knowing their impact, backlogs grow based on opinion rather than evidence, and the product drifts in whatever direction the loudest voice demands.
Hypothesis-driven development fixes this by making every change answer a question. If the answer is “yes, it helped,” the team invests further. If the answer is “no,” the team reverts or pivots before sinking more effort into the wrong direction. Over time, this produces a product shaped by evidence rather than assumptions.
The Lifecycle
The hypothesis-driven development lifecycle has five stages. Each stage has a specific purpose and a clear output that feeds the next stage.
A hypothesis is a falsifiable prediction about what a change will accomplish. It follows a specific format:
“We believe [change] will produce [outcome] because [reason].”
The “because” clause is critical. Without it, you have a wish, not a hypothesis. The reason forces the team to articulate the causal model behind the change, which makes it possible to learn even when the experiment fails.
**Good:** "We believe adding a progress indicator to the checkout flow will reduce cart abandonment by 10% because users currently leave when they cannot tell how many steps remain."
- Specific change (progress indicator in checkout)
- Measurable outcome (10% reduction in cart abandonment)
- Stated reason (users leave due to uncertainty about remaining steps)
---
**Bad:** "We believe improving the checkout experience will increase conversions."
- Vague change (what does "improving" mean?)
- No target (how much increase?)
- No reason (why would it increase conversions?)
Criteria for a testable hypothesis:
| Criterion | Test | Example |
|---|
| Specific change | Can you describe exactly what will be different? | “Add a 3-step progress bar to the checkout page header” |
| Measurable outcome | Can you define a number that will move? | “Cart abandonment rate drops from 45% to 40%” |
| Time-bound | Do you know when to check? | “Measured over 2 weeks with at least 5,000 sessions” |
| Falsifiable | Is it possible for the experiment to fail? | Yes - abandonment could stay the same or increase |
| Connected to business value | Does the outcome matter to the business? | Reduced abandonment directly increases revenue |
2. Design the Experiment
Once the hypothesis is formed, design an experiment that can confirm or reject it.
Scope the change to one variable. If you change the checkout layout and add a progress indicator and reduce the number of form fields at the same time, you cannot attribute the outcome to any single change. Change one thing at a time.
Define success and failure criteria before writing code. This prevents moving the goalposts after seeing the results. Write down what “success” looks like and what “failure” looks like before the first commit.
**Hypothesis:** Adding a progress indicator will reduce cart abandonment by 10%.
**Method:** A/B test - 50% of users see the progress indicator, 50% see the current checkout.
**Success criteria:** Abandonment rate in the test group is at least 8% lower than control (allowing a 2% margin).
**Failure criteria:** Abandonment rate difference is less than 5%, or the test group shows higher abandonment.
**Sample size:** Minimum 5,000 sessions per group.
**Time box:** 2 weeks or until sample size is reached, whichever comes first.
Choose the measurement method:
| Method | When to Use | Tradeoff |
|---|
| A/B test | You have enough traffic to split users into groups | Most rigorous, but requires sufficient volume |
| Before/after | Low traffic or infrastructure changes that affect everyone | Simpler, but confounding factors are harder to control |
| Cohort comparison | Targeting a specific user segment | Good for segment-specific changes, harder to generalize |
3. Implement and Deploy
Build the change using the same continuous delivery practices you use for any other work.
Use feature flags to control exposure. The feature flag infrastructure you built earlier in this phase is what makes experiments possible. Deploy the change behind a flag, then use the flag to control which users see the new behavior.
Deploy through the standard CD pipeline. Experiments are not special. They go through the same build, test, and deployment process as every other change. This ensures the experiment code meets the same quality bar as production code.
Keep the change small. A hypothesis-driven change should follow the same small batch discipline as any other work. If the experiment requires weeks of development, the scope is too large. Break it into smaller experiments that can each be measured independently.
Example implementation:
4. Measure Results
After the time box expires or the sample size is reached, compare the results against the predefined success criteria.
Compare against your criteria, not against your hopes. If the success criterion was “8% reduction in abandonment” and you achieved 3%, that is a failure by your own definition, even if 3% sounds nice. Rigorous criteria prevent confirmation bias.
Account for confounding factors. Did a marketing campaign run during the experiment? Was there a holiday? Did another team ship a change that affects the same flow? Document anything that might have influenced the results.
Record the outcome regardless of success or failure. Failed experiments are as valuable as successful ones. They update the team’s understanding of how the product works and prevent repeating the same mistakes.
**Hypothesis:** Progress indicator reduces cart abandonment by 10%.
**Result:** Abandonment dropped 4% in the test group (not statistically significant at p < 0.05).
**Verdict:** Failed - did not meet the 8% threshold.
**Confounding factors:** A site-wide sale ran during week 2, which may have increased checkout motivation in both groups.
**Learning:** Progress visibility alone is not sufficient to address abandonment. Exit survey data suggests price comparison (leaving to check competitors) is the primary driver, not checkout confusion.
**Next action:** Design a new experiment targeting price confidence instead of checkout flow.
5. Adjust
The final stage closes the loop. Based on the results, the team takes one of three actions:
If validated: Remove the feature flag and make the change permanent. Update the product documentation. Feed the learning into the next hypothesis - what else could you improve now that this change is in place?
If invalidated: Revert the change by disabling the flag. Document what was learned and why the hypothesis was wrong. Use the learning to form a better hypothesis. Do not treat invalidation as failure - a team that never invalidates a hypothesis is not running real experiments.
If inconclusive: Decide whether to extend the experiment (more time, more traffic) or abandon it. If confounding factors were identified, consider rerunning the experiment under cleaner conditions. Set a hard limit on reruns to avoid indefinite experimentation.
Common Pitfalls
| Pitfall | What Happens | How to Avoid It |
|---|
| No success criteria defined upfront | Team rationalizes any result as a win | Write success and failure criteria before the first commit |
| Changing multiple variables at once | Cannot attribute the outcome to any single change | Scope each experiment to one variable |
| Abandoning experiments too early | Insufficient data leads to wrong conclusions | Set a minimum sample size and time box; commit to both |
| Never invalidating a hypothesis | Experiments are performative, not real | Celebrate invalidations - they prevent wasted effort |
| Skipping the record step | Team repeats failed experiments or forgets what worked | Maintain an experiment log that is part of the team’s knowledge base |
| Hypothesis disconnected from business outcomes | Team optimizes technical metrics nobody cares about | Every hypothesis must connect to a metric the business tracks |
| Experiments that are too large | Weeks of development before any measurement | Apply small batch discipline to experiments too |
Measuring Success
| Indicator | Target | Why It Matters |
|---|
| Experiments completed per quarter | 4 or more | Confirms the team is running experiments, not just shipping features |
| Percentage of experiments with predefined success criteria | 100% | Confirms rigor - no experiment should start without criteria |
| Ratio of validated to invalidated hypotheses | Between 40-70% validated | Too high means hypotheses are not bold enough; too low means the team is guessing |
| Time from hypothesis to result | 2-4 weeks | Confirms experiments are scoped small enough to get fast answers |
| Decisions changed by experiment results | Increasing | Confirms experiments actually influence product direction |
Next Step
Experiments generate learnings, but learnings only turn into improvements when the team discusses them. Retrospectives provide the forum where the team reviews experiment results, decides what to do next, and adjusts the process itself.
Related Content