This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Phase 3: Optimize

Improve flow by reducing batch size, limiting work in progress, and using metrics to drive improvement.

1: Small Batches
2: Feature Flags
3: Limiting Work in Progress
4: Metrics-Driven Improvement
5: Retrospectives
6: Architecture Decoupling
7: Team Alignment to Code
8: Hypothesis-Driven Development

Key question: “Can we deliver small changes quickly?”

With a working pipeline in place, this phase focuses on optimizing the flow of changes through it. Smaller batches, feature flags, and WIP limits reduce risk and increase delivery frequency.

What You’ll Do

Reduce batch size - Deliver smaller, more frequent changes
Use feature flags - Decouple deployment from release
Limit work in progress - Focus on finishing over starting
Drive improvement with metrics - Use DORA metrics and improvement kata
Run effective retrospectives - Continuously improve the delivery process
Decouple architecture - Enable independent deployment of components
Align teams to code - Match team ownership to code boundaries for independent deployment

Why This Phase Matters

Having a pipeline isn’t enough - you need to optimize the flow through it. Teams that deploy weekly with a CD pipeline are missing most of the benefits. Small batches reduce risk, feature flags enable testing in production, and metrics-driven improvement creates a virtuous cycle of getting better at getting better.

When You’re Ready to Move On

You’re ready for Phase 4: Deliver on Demand when:

Most changes are small enough to deploy independently
Feature flags let you deploy incomplete features safely
Your WIP limits keep work flowing without bottlenecks
You’re measuring and improving your DORA metrics regularly

Next: Phase 4 - Continuous Deployment - remove the last manual gates and deploy on demand.

Phase 2: Pipeline - the previous phase that establishes the deployment pipeline this phase optimizes
Phase 4: Deliver on Demand - the next phase after flow is optimized
Infrequent Releases - a key symptom that the Optimize phase addresses
Too Much WIP - a flow symptom targeted by WIP limits and small batches
DORA Recommended Practices - the research-backed capabilities that drive delivery performance
Deployment Frequency - the primary metric that improves as optimization takes hold

1 - Small Batches

Deliver smaller, more frequent changes to reduce risk and increase feedback speed.

Phase 3 - Optimize

Batch size is the single biggest lever for improving delivery performance. This page covers what batch size means at every level - deploy frequency, commit size, and story size - and provides concrete techniques for reducing it.

Why Batch Size Matters

Large batches create large risks. When you deploy 50 changes at once, any failure could be caused by any of those 50 changes. When you deploy 1 change, the cause of any failure is obvious.

This is not a theory. The DORA research consistently shows that elite teams deploy more frequently, with smaller changes, and have both higher throughput and lower failure rates. Small batches are the mechanism that makes this possible.

“If it hurts, do it more often, and bring the pain forward.”
Jez Humble, Continuous Delivery

Three Levels of Batch Size

Batch size is not just about deployments. It operates at three distinct levels, and optimizing only one while ignoring the others limits your improvement.

Level 1: Deploy Frequency

How often you push changes to production.

State	Deploy Frequency	Risk Profile
Starting	Monthly or quarterly	Each deploy is a high-stakes event
Improving	Weekly	Deploys are planned but routine
Optimizing	Daily	Deploys are unremarkable
Elite	Multiple times per day	Deploys are invisible

How to reduce: Remove manual gates, automate approval workflows, build confidence through progressive rollout. If your pipeline is reliable (Phase 2), the only thing preventing more frequent deploys is organizational habit.

Common objections to deploying more often:

“Incomplete features have no value.” Value is not limited to end-user features. Every deployment provides value to other stakeholders: operations verifies that the change is safe, QA confirms quality gates pass, and the team reduces inventory waste by keeping unintegrated work near zero. A partially built feature deployed behind a flag validates the deployment pipeline and reduces the risk of the final release.
“Our customers don’t want changes that frequently.” CD is not about shipping user-visible changes every hour. It is about maintaining the ability to deploy at any time. That ability is what lets you ship an emergency fix in minutes instead of days, roll out a security patch without a war room, and support production without heroics.

Level 2: Commit Size

How much code changes in each commit to trunk.

Indicator	Too Large	Right-Sized
Files changed	20+ files	1-5 files
Lines changed	500+ lines	Under 100 lines
Review time	Hours or days	Minutes
Merge conflicts	Frequent	Rare
Description length	Paragraph needed	One sentence suffices

How to reduce: Practice TDD (write one test, make it pass, commit). Use feature flags to merge incomplete work. Pair program so review happens in real time.

Level 3: Story Size

How much scope each user story or work item contains.

A story that takes a week to complete is a large batch. It means a week of work piles up before integration, a week of assumptions go untested, and a week of inventory sits in progress.

Target: Every story should be completable - coded, tested, reviewed, and integrated - in two days or less. If it cannot be, it needs to be decomposed further.

“If a story is going to take more than a day to complete, it is too big.”
Paul Hammant

This target is not aspirational. Teams that adopt hyper-sprints - iterations as short as 2.5 days - find that the discipline of writing one-day stories forces better decomposition and faster feedback. Teams that make this shift routinely see throughput double, not because people work faster, but because smaller stories flow through the system with less wait time, fewer handoffs, and fewer defects.

Behavior-Driven Development for Decomposition

BDD provides a concrete technique for breaking stories into small, testable increments. The Given-When-Then format forces clarity about scope.

The Given-When-Then Pattern

BDD scenarios for shopping cart discount feature

Feature: Shopping cart discount

  Scenario: Apply percentage discount to cart
    Given a cart with items totaling $100
    When I apply a 10% discount code
    Then the cart total should be $90

  Scenario: Reject expired discount code
    Given a cart with items totaling $100
    When I apply an expired discount code
    Then the cart total should remain $100
    And I should see "This discount code has expired"

  Scenario: Apply discount only to eligible items
    Given a cart with one eligible item at $50 and one ineligible item at $50
    When I apply a 10% discount code
    Then the cart total should be $95

Each scenario becomes a deliverable increment. You can implement and deploy the first scenario before starting the second. This is how you turn a “discount feature” (large batch) into three independent, deployable changes (small batches).

Decomposing Stories Using Scenarios

When a story has too many scenarios, it is too large. Use this process:

Write all the scenarios first. Before any code, enumerate every Given-When-Then for the story.
Group scenarios into deliverable slices. Each slice should be independently valuable or at least independently deployable.
Create one story per slice. Each story has 1-3 scenarios and can be completed in 1-2 days.
Order the slices by value. Deliver the most important behavior first.

Example decomposition:

Original Story	Scenarios	Sliced Into
“As a user, I can manage my profile”	12 scenarios covering name, email, password, avatar, notifications, privacy, deactivation	5 stories: basic info (2 scenarios), password (2), avatar (2), notifications (3), deactivation (3)

ATDD: Connecting Scenarios to Daily Integration

BDD scenarios define what to build. Acceptance Test-Driven Development (ATDD) defines how to build it in small, integrated steps. The workflow is:

Pick one scenario. Choose the next Given-When-Then from your story.
Write the acceptance test first. Automate the scenario so it runs against the real system (or a close approximation). It will fail - this is the RED state.
Write just enough code to pass. Implement the minimum production code to make the acceptance test pass - the GREEN state.
Refactor. Clean up the code while the test stays green.
Commit and integrate. Push to trunk. The pipeline verifies the change.
Repeat. Pick the next scenario.

Each cycle produces a commit that is independently deployable and verified by an automated test. This is how BDD scenarios translate directly into a stream of small, safe integrations rather than a batch of changes delivered at the end of a story.

Key benefits:

Every commit has a corresponding acceptance test, so you know exactly what it does and that it works.
You never go more than a few hours without integrating to trunk.
The acceptance tests accumulate into a regression suite that protects future changes.
If a commit breaks something, the scope of the change is small enough to diagnose quickly.

Service-Level Decomposition Example

ATDD works at the API and service level, not just at the UI level. Here is an example of building an order history endpoint day by day:

Day 1 - Return an empty list for a customer with no orders:

Day 1 scenario: empty order history endpoint

Scenario: Customer with no order history
  Given a customer with no previous orders
  When I request their order history
  Then I receive an empty list with a 200 status

Commit: Implement the endpoint, return an empty JSON array. Acceptance test passes.

Day 2 - Return a single order with basic fields:

Day 2 scenario: return a single order with basic fields

Scenario: Customer with one completed order
  Given a customer with one completed order for $49.99
  When I request their order history
  Then I receive a list with one order showing the total and status

Commit: Query the orders table, serialize basic fields. Previous test still passes.

Day 3 - Return multiple orders sorted by date:

Day 3 scenario: return orders sorted by date

Scenario: Orders returned in reverse chronological order
  Given a customer with orders placed on Jan 1, Feb 1, and Mar 1
  When I request their order history
  Then the orders are returned with the Mar 1 order first

Commit: Add sorting logic and pagination. All three tests pass.

Each day produces a deployable change. The endpoint is usable (though minimal) after day 1. No day requires more than a few hours of coding because the scope is constrained by a single scenario.

Vertical Slicing

A vertical slice cuts through all layers of the system to deliver a thin piece of end-to-end functionality. This is the opposite of horizontal slicing, where you build all the database changes, then all the API changes, then all the UI changes.

Horizontal vs. Vertical Slicing

Horizontal (avoid):

Horizontal slicing: stories split by architectural layer

Story 1: Build the database schema for discounts
Story 2: Build the API endpoints for discounts
Story 3: Build the UI for applying discounts

Problems: Story 1 and 2 deliver no user value. You cannot test end-to-end until story 3 is done. Integration risk accumulates.

Vertical (prefer):

Vertical slicing: stories split by user-observable behavior

Story 1: Apply a simple percentage discount (DB + API + UI for one scenario)
Story 2: Reject expired discount codes (DB + API + UI for one scenario)
Story 3: Apply discounts only to eligible items (DB + API + UI for one scenario)

Benefits: Every story delivers testable, deployable functionality. Integration happens with each story, not at the end. You can ship story 1 and get feedback before building story 2.

How to Slice Vertically

Ask these questions about each proposed story:

Can a user (or another system) observe the change? If not, slice differently.
Can I write an end-to-end test for it? If not, the slice is incomplete.
Does it require all other slices to be useful? If yes, find a thinner first slice.
Can it be deployed independently? If not, check whether feature flags could help.

Vertical slicing in distributed systems

The examples above assume a team that owns the full stack - UI, API, and database. In large distributed systems, most teams own a subdomain and may not be directly user-facing.

The principle is the same. A subdomain product team’s vertical slice cuts through all layers they control: the service API, the business logic, and the data store. “End-to-end” means end-to-end within your domain, not end-to-end across the entire system. The team deploys independently behind a stable contract, without coordinating with other teams.

The key difference is whether the public interface is designed for humans or machines. A full-stack product team owns a human-facing surface - the slice is done when a user can observe the behavior through that interface. A subdomain product team owns a machine-facing surface - the slice is done when the API contract satisfies the agreed behavior for its service consumers.

See Work Decomposition for diagrams of both contexts, and Horizontal Slicing for the failure mode that emerges when distributed teams split work by layer instead of by behavior.

Story Slicing Anti-Patterns

These are common ways teams slice stories that undermine the benefits of small batches:

Wrong: Slice by layer. “Story 1: Build the database. Story 2: Build the API. Story 3: Build the UI.” Right: Slice vertically so each story touches all layers and delivers observable behavior.

Wrong: Slice by activity. “Story 1: Design. Story 2: Implement. Story 3: Test.” Right: Each story includes all activities needed to deliver and verify one behavior.

Wrong: Create dependent stories. “Story 2 cannot start until Story 1 is finished because it depends on the data model.” Right: Each story is independently deployable. Use contracts, feature flags, or stubs to break dependencies between stories.

Wrong: Lose testability. “This story just sets up infrastructure - there is nothing to test yet.” Right: Every story has at least one automated test that verifies its behavior. If you cannot write a test, the slice does not deliver observable value.

Practical Steps for Reducing Batch Size

Step 1: Measure Current State

Before changing anything, measure where you are:

Average commit size (lines changed per commit)
Average story cycle time (time from start to done)
Deploy frequency (how often changes reach production)
Average changes per deploy (how many commits per deployment)

Step 2: Introduce Story Decomposition

Start writing BDD scenarios before implementation
Split any story estimated at more than 2 days
Track the number of stories completed per week (expect this to increase as stories get smaller)

Step 3: Tighten Commit Size

Adopt the discipline of “one logical change per commit”
Use TDD to create a natural commit rhythm: write test, make it pass, commit
Track average commit size and set a team target (e.g., under 100 lines)

Ongoing: Increase Deploy Frequency

Deploy at least once per day, then work toward multiple times per day
Remove any batch-oriented processes (e.g., “we deploy on Tuesdays”)
Make deployment a non-event

Key Pitfalls

1. “Small stories take more overhead to manage”

This is true only if your process adds overhead per story (e.g., heavyweight estimation ceremonies, multi-level approval). The solution is to simplify the process, not to keep stories large. Overhead per story should be near zero for a well-decomposed story.

2. “Some things can’t be done in small batches”

Almost anything can be decomposed further. Database migrations can be done in backward-compatible steps. API changes can use versioning. UI changes can be hidden behind feature flags. The skill is in finding the decomposition, not in deciding whether one exists.

3. “We tried small stories but our throughput dropped”

This usually means the team is still working sequentially. Small stories require limiting WIP and swarming - see Limiting WIP. If the team starts 10 small stories instead of 2 large ones, they have not actually reduced batch size; they have increased WIP.

Measuring Success

Metric	Target	Why It Matters
Development cycle time	< 2 days per story	Confirms stories are small enough to complete quickly
Integration frequency	Multiple times per day	Confirms commits are small and frequent
Release frequency	Daily or more	Confirms deploys are routine
Change fail rate	Decreasing	Confirms small changes reduce failure risk

Next Step

Small batches often require deploying incomplete features to production. Feature Flags provide the mechanism to do this safely.

Infrequent Releases - the symptom of deploying too rarely that small batches directly address
Hardening Sprints - a symptom caused by large batch sizes requiring stabilization periods
Monolithic Work Items - the anti-pattern of stories too large to deliver in small increments
Horizontal Slicing - the anti-pattern of splitting work by layer instead of by value
Work Decomposition - the foundational practice for breaking work into small deliverable pieces
Feature Flags - the mechanism that makes deploying incomplete small batches safe
Small-Batch Agent Sessions - applying the same one-scenario-one-commit discipline to agent-generated work

2 - Feature Flags

Decouple deployment from release by using feature flags to control feature visibility.

Phase 3 - Optimize

Feature flags are the mechanism that makes trunk-based development and small batches safe. They let you deploy code to production without exposing it to users, enabling dark launches, gradual rollouts, and instant rollback of features without redeploying.

Why Feature Flags?

In continuous delivery, deployment and release are two separate events:

Deployment is pushing code to production.
Release is making a feature available to users.

Feature flags are the bridge between these two events. They let you deploy frequently (even multiple times a day) without worrying about exposing incomplete or untested features. This separation is what makes continuous deployment possible for teams that ship real products to real users.

When You Need Feature Flags (and When You Don’t)

Not every change requires a feature flag. Flags add complexity, and unnecessary complexity slows you down. Use this decision tree to determine the right approach.

Decision Tree

graph TD
    Start[New Code Change] --> Q1{Is this a large or<br/>high-risk change?}

    Q1 -->|Yes| Q2{Do you need gradual<br/>rollout or testing<br/>in production?}
    Q1 -->|No| Q3{Is the feature<br/>incomplete or spans<br/>multiple releases?}

    Q2 -->|Yes| UseFF1[YES - USE FEATURE FLAG<br/>Enables safe rollout<br/>and quick rollback]
    Q2 -->|No| Q4{Do you need to<br/>test in production<br/>before full release?}

    Q3 -->|Yes| Q3A{Can you use an<br/>alternative pattern?}
    Q3 -->|No| Q5{Do different users/<br/>customers need<br/>different behavior?}

    Q3A -->|New Feature| NoFF_NewFeature[NO FLAG NEEDED<br/>Connect to tests only,<br/>integrate in final commit]
    Q3A -->|Behavior Change| NoFF_Abstraction[NO FLAG NEEDED<br/>Use branch by<br/>abstraction pattern]
    Q3A -->|New API Route| NoFF_API[NO FLAG NEEDED<br/>Build route, expose<br/>as last change]
    Q3A -->|Not Applicable| UseFF2[YES - USE FEATURE FLAG<br/>Enables trunk-based<br/>development]

    Q4 -->|Yes| UseFF3[YES - USE FEATURE FLAG<br/>Dark launch or<br/>beta testing]
    Q4 -->|No| Q6{Is this an<br/>experiment or<br/>A/B test?}

    Q5 -->|Yes| UseFF4[YES - USE FEATURE FLAG<br/>Customer-specific<br/>toggles needed]
    Q5 -->|No| Q7{Does change require<br/>coordination with<br/>other teams/services?}

    Q6 -->|Yes| UseFF5[YES - USE FEATURE FLAG<br/>Required for<br/>experimentation]
    Q6 -->|No| NoFF1[NO FLAG NEEDED<br/>Simple change,<br/>deploy directly]

    Q7 -->|Yes| UseFF6[YES - USE FEATURE FLAG<br/>Enables independent<br/>deployment]
    Q7 -->|No| Q8{Is this a bug fix<br/>or hotfix?}

    Q8 -->|Yes| NoFF2[NO FLAG NEEDED<br/>Deploy immediately]
    Q8 -->|No| NoFF3[NO FLAG NEEDED<br/>Standard deployment<br/>sufficient]

    style UseFF1 fill:#90EE90
    style UseFF2 fill:#90EE90
    style UseFF3 fill:#90EE90
    style UseFF4 fill:#90EE90
    style UseFF5 fill:#90EE90
    style UseFF6 fill:#90EE90
    style NoFF1 fill:#FFB6C6
    style NoFF2 fill:#FFB6C6
    style NoFF3 fill:#FFB6C6
    style NoFF_NewFeature fill:#FFB6C6
    style NoFF_Abstraction fill:#FFB6C6
    style NoFF_API fill:#FFB6C6
    style Start fill:#87CEEB

Alternatives to Feature Flags

Technique	How It Works	When to Use
Branch by Abstraction	Introduce an abstraction layer, build the new implementation behind it, switch when ready	Replacing an existing subsystem or library
Connect Tests Last	Build internal components without connecting them to the UI or API	New backend functionality that has no user-facing impact until connected
Dark Launch	Deploy the code path but do not route any traffic to it	New infrastructure, new services, or new endpoints that are not yet referenced

These alternatives avoid the lifecycle overhead of feature flags while still enabling trunk-based development with incomplete work.

Implementation Approaches

Feature flags can be implemented at different levels of sophistication. Start simple and add complexity only when needed.

Level 1: Static Code-Based Flags

The simplest approach: a boolean constant or configuration value checked in code.

Level 1: Static boolean flag in code

# config.py
FEATURE_NEW_CHECKOUT = False

# checkout.py
from config import FEATURE_NEW_CHECKOUT

def process_checkout(cart, user):
    if FEATURE_NEW_CHECKOUT:
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Zero infrastructure. Easy to understand. Works everywhere.

Cons: Changing a flag requires a deployment. No per-user targeting. No gradual rollout.

Best for: Teams starting out. Internal tools. Changes that will be fully on or fully off.

Level 2: Dynamic In-Process Flags

Flags stored in a configuration file, database, or environment variable that can be changed at runtime without redeploying.

Level 2: Dynamic in-process flag service with percentage rollout

# flag_service.py
import json

class FeatureFlags:
    def __init__(self, config_path="/etc/flags.json"):
        self._config_path = config_path

    def is_enabled(self, flag_name, context=None):
        flags = json.load(open(self._config_path))
        flag = flags.get(flag_name, {})

        if not flag.get("enabled", False):
            return False

        # Percentage rollout
        if "percentage" in flag and context and "user_id" in context:
            return (hash(context["user_id"]) % 100) < flag["percentage"]

        return True

Level 2: Flag configuration file with percentage rollout

{
  "new-checkout": {
    "enabled": true,
    "percentage": 10
  }
}

Pros: No redeployment needed. Supports percentage rollout. Simple to implement.

Cons: Each instance reads its own config - no centralized view. Limited targeting capabilities.

Best for: Teams that need gradual rollout but do not want to adopt a third-party service yet.

Level 3: Centralized Flag Service

A dedicated service (self-hosted or SaaS) that manages all flags, provides a dashboard, supports targeting rules, and tracks flag usage.

Examples: LaunchDarkly, Unleash, Flagsmith, Split, or a custom internal service.

Level 3: Centralized flag service with user-context targeting

from feature_flag_client import FlagClient

client = FlagClient(api_key="...")

def process_checkout(cart, user):
    if client.is_enabled("new-checkout", user_context={"id": user.id, "plan": user.plan}):
        return new_checkout_flow(cart, user)
    else:
        return legacy_checkout_flow(cart, user)

Pros: Centralized management. Rich targeting (by user, plan, region, etc.). Audit trail. Real-time changes.

Cons: Added dependency. Cost (for SaaS). Network latency for flag evaluation (mitigated by local caching in most SDKs).

Best for: Teams at scale. Products with diverse user segments. Regulated environments needing audit trails.

Level 4: Infrastructure Routing

Instead of checking flags in application code, route traffic at the infrastructure level (load balancer, service mesh, API gateway).

Level 4: Istio VirtualService for infrastructure-level traffic routing

# Istio VirtualService example
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-service
spec:
  hosts:
    - checkout
  http:
    - match:
        - headers:
            x-feature-group:
              exact: "beta"
      route:
        - destination:
            host: checkout-v2
    - route:
        - destination:
            host: checkout-v1

Pros: No application code changes. Clean separation of routing from logic. Works across services.

Cons: Requires infrastructure investment. Less granular than application-level flags. Harder to target individual users.

Best for: Microservice architectures. Service-level rollouts. A/B testing at the infrastructure layer.

Feature Flag Lifecycle

Every feature flag has a lifecycle. Flags that are not actively managed become technical debt. Follow this lifecycle rigorously.

The Stages

Feature flag lifecycle: the stages from create to remove

1. CREATE       → Define the flag, document its purpose and owner
2. DEPLOY OFF   → Code ships to production with the flag disabled
3. BUILD        → Incrementally add functionality behind the flag
4. DARK LAUNCH  → Enable for internal users or a small test group
5. ROLLOUT      → Gradually increase the percentage of users
6. REMOVE       → Delete the flag and the old code path

Stage 1: Create

Before writing any code, define the flag:

Name: Use a consistent naming convention (e.g., enable-new-checkout, feature.discount-engine)
Owner: Who is responsible for this flag through its lifecycle?
Purpose: One sentence describing what the flag controls
Planned removal date: Set this at creation time. Flags without removal dates become permanent.

Stage 2: Deploy OFF

The first deployment includes the flag check but the flag is disabled. This verifies that:

The flag infrastructure works
The default (off) path is unaffected
The flag check does not introduce performance issues

Stage 3: Build Incrementally

Continue building the feature behind the flag over multiple deploys. Each deploy adds more functionality, but the flag remains off for users. Test both paths in your automated suite:

Testing both flag states: parametrize over enabled and disabled

@pytest.mark.parametrize("flag_enabled", [True, False])
def test_checkout_with_flag(flag_enabled, monkeypatch):
    monkeypatch.setattr(flags, "is_enabled", lambda name, ctx=None: flag_enabled)
    result = process_checkout(cart, user)
    assert result.status == "success"

Stage 4: Dark Launch

Enable the flag for internal users or a specific test group. This is your first validation with real production data and real traffic patterns. Monitor:

Error rates for the flagged group vs. control
Performance metrics (latency, throughput)
Business metrics (conversion, engagement)

Stage 5: Gradual Rollout

Increase exposure systematically:

Step	Audience	Duration	What to Watch
1	1% of users	1-2 hours	Error rates, latency
2	5% of users	4-8 hours	Performance at slightly higher load
3	25% of users	1 day	Business metrics begin to be meaningful
4	50% of users	1-2 days	Statistically significant business impact
5	100% of users	-	Full rollout

At any step, if metrics degrade, roll back by disabling the flag. No redeployment needed.

Stage 6: Remove

This is the most commonly skipped step, and skipping it creates significant technical debt.

Once the feature has been stable at 100% for an agreed period (e.g., 2 weeks):

Remove the flag check from code
Remove the old code path
Remove the flag definition from the flag service
Deploy the simplified code

Set a maximum flag lifetime. A common practice is 90 days. Any flag older than 90 days triggers an automatic review. Stale flags are a maintenance burden and a source of confusion.

Lifecycle Timeline Example

Day	Action	Flag State
1	Deploy flag infrastructure and create removal ticket	OFF
2-5	Build feature behind flag, integrate daily	OFF
6	Enable for internal users (dark launch)	ON for 0.1%
7	Enable for 1% of users	ON for 1%
8	Enable for 5% of users	ON for 5%
9	Enable for 25% of users	ON for 25%
10	Enable for 50% of users	ON for 50%
11	Enable for 100% of users	ON for 100%
12-18	Stability period (monitor)	ON for 100%
19-21	Remove flag from code	DELETED

Total lifecycle: approximately 3 weeks from creation to removal.

Long-Lived Feature Flags

Not all flags are temporary. Some flags are intentionally permanent and should be managed differently from release flags.

Operational Flags (Kill Switches)

Purpose: Disable expensive or non-critical features under load during incidents.

Lifecycle: Permanent.

Management: Treat as system configuration, not as a release mechanism.

Operational kill switch: disable expensive features during incidents

# PERMANENT FLAG - System operational control
# Used to disable expensive features during incidents
if flags.is_enabled("enable-recommendations"):
    recommendations = compute_recommendations(user)
else:
    recommendations = []  # Graceful degradation under load

Customer-Specific Toggles

Purpose: Different customers receive different features based on their subscription or contract.

Lifecycle: Permanent, tied to customer configuration.

Management: Part of the customer entitlement system, not the feature flag system.

Customer entitlement toggle: gate features by subscription level

# PERMANENT FLAG - Customer entitlement
# Controlled by customer subscription level
if customer.subscription.includes("analytics"):
    show_advanced_analytics(customer)

Experimentation Flags

Purpose: A/B testing and experimentation.

Lifecycle: The flag infrastructure is permanent, but individual experiments expire.

Management: Each experiment has its own expiration date and success criteria. The experimentation platform itself persists.

Experimentation flag: route users to A/B test variants

# PERMANENT FLAG - Experimentation platform
# Individual experiments expire, platform remains
variant = experiments.get("checkout-optimization")
if variant == "streamlined":
    return streamlined_checkout(cart, user)
else:
    return standard_checkout(cart, user)

Managing Long-Lived Flags

Long-lived flags need different discipline than temporary ones:

Use a separate naming convention (e.g., KILL_SWITCH_*, ENTITLEMENT_*) to distinguish them from temporary release flags
Document why each flag is permanent so future team members understand the intent
Store them separately from temporary flags in your management system
Review regularly to confirm they are still needed

Key Pitfalls

1. “We have 200 feature flags and nobody knows what they all do”

This is flag debt, and it is as damaging as any other technical debt. Prevent it by enforcing the lifecycle: every flag has an owner, a purpose, and a removal date. Run a monthly flag audit.

2. “We use flags for everything, including configuration”

Feature flags and configuration are different concerns. Flags are temporary (they control unreleased features). Configuration is permanent (it controls operational behavior like timeouts, connection pools, log levels). Mixing them leads to confusion about what can be safely removed.

3. “Testing both paths doubles our test burden”

It does increase test effort, but this is a temporary cost. When the flag is removed, the extra tests go away too. The alternative - deploying untested code paths - is far more expensive.

4. “Nested flags create combinatorial complexity”

Avoid nesting flags whenever possible. If feature B depends on feature A, do not create a separate flag for B. Instead, extend the behavior behind feature A’s flag. If you must nest, document the dependency and test the specific combinations that matter.

Flag Removal Anti-Patterns

These specific patterns are the most common ways teams fail at flag cleanup.

Don’t skip the removal ticket:

WRONG: “We’ll remove it later when we have time”
RIGHT: Create a removal ticket at the same time you create the flag

Don’t leave flags after full rollout:

WRONG: Flag still in code 6 months after 100% rollout
RIGHT: Remove within 2-4 weeks of full rollout

Don’t forget to remove the old code path:

WRONG: Flag removed but old implementation still in the codebase
RIGHT: Remove the flag check AND the old implementation together

Don’t keep flags “just in case”:

WRONG: “Let’s keep it in case we need to roll back in the future”
RIGHT: After the stability period, rollback is handled by deployment, not by re-enabling a flag

Measuring Success

Metric	Target	Why It Matters
Active flag count	Stable or decreasing	Confirms flags are being removed, not accumulating
Average flag age	< 90 days	Catches stale flags before they become permanent
Flag-related incidents	Near zero	Confirms flag management is not causing problems
Time from deploy to release	Hours to days (not weeks)	Confirms flags enable fast, controlled releases

Next Step

Small batches and feature flags let you deploy more frequently, but deploying more means more work in progress. Limiting WIP ensures that increased deploy frequency does not create chaos.

Fear of Deploying - a symptom that feature flags help eliminate by making deployments reversible
Infrequent Releases - the symptom of batching releases that flags help break
Small Batches - the practice that feature flags make safe for incomplete work
Progressive Rollout - the deployment strategy that builds on feature flag capabilities
Trunk-Based Development - the branching strategy that feature flags enable
Limiting WIP - the next step after feature flags to manage increased deployment frequency
Hypothesis-Driven Development - using feature flags to control experiment exposure

3 - Limiting Work in Progress

Focus on finishing work over starting new work to improve flow and reduce cycle time.

Phase 3 - Optimize

Work in progress (WIP) is inventory. Like physical inventory, it loses value the longer it sits unfinished. Limiting WIP is the most counterintuitive and most impactful practice in this entire migration: doing less work at once makes you deliver more.

Why Limiting WIP Matters

Every item of work in progress has a cost:

Context switching: Moving between tasks destroys focus. Research consistently shows that switching between two tasks reduces productive time by 20-40%.
Delayed feedback: Work that is started but not finished cannot be validated by users. The longer it sits, the more assumptions go untested.
Hidden dependencies: The more items in progress simultaneously, the more likely they are to conflict, block each other, or require coordination.
Longer cycle time: Little’s Law states that cycle time = WIP / throughput. If throughput is constant, the only way to reduce cycle time is to reduce WIP.

“Stop starting, start finishing.”
Lean saying

How to Set Your WIP Limit

The N+2 Starting Point

A practical starting WIP limit for a team is N+2, where N is the number of team members actively working on delivery.

Team Size	Starting WIP Limit	Rationale
3 developers	5 items	Allows one item per person plus a small buffer
5 developers	7 items	Same principle at larger scale
8 developers	10 items	Buffer becomes proportionally smaller

Why N+2 and not N? Because some items will be blocked waiting for review, testing, or external dependencies. A small buffer prevents team members from being idle when their primary task is blocked. But the buffer should be small - two items, not ten.

Continuously Lower the Limit

The N+2 formula is a starting point, not a destination. Once the team is comfortable with the initial limit, reduce it:

Start at N+2. Run for 2-4 weeks. Observe where work gets stuck.
Reduce to N+1. Tighten the limit. Some team members will occasionally be “idle” - this is a feature, not a bug. They should swarm on blocked items.
Reduce to N. At this point, every team member is working on exactly one thing. Blocked work gets immediate attention because someone is always available to help.
Consider going below N. Some teams find that pairing (two people, one item) further reduces cycle time. A team of 6 with a WIP limit of 3 means everyone is pairing.

Each reduction will feel uncomfortable. That discomfort is the point - it exposes problems in your workflow that were previously hidden by excess WIP.

What Happens When You Hit the Limit

When the team reaches its WIP limit and someone finishes a task, they have two options:

Pull the next highest-priority item (if the WIP limit allows it).
Swarm on an existing item that is blocked, stuck, or nearing its cycle time target.

When the WIP limit is reached and no items are complete:

Do not start new work. This is the hardest part and the most important.
Help unblock existing work. Pair with someone. Review a pull request. Write a missing test. Talk to the person who has the answer to the blocking question.
Improve the process. If nothing is blocked but everything is slow, this is the time to work on automation, tooling, or documentation.

Swarming

Swarming is the practice of multiple team members working together on a single item to get it finished faster. It is the natural complement to WIP limits.

When to Swarm

An item has been in progress for longer than the team’s cycle time target (e.g., more than 2 days)
An item is blocked and the blocker can be resolved by another team member
The WIP limit is reached and someone needs work to do
A critical defect needs to be fixed immediately

How to Swarm Effectively

Approach	How It Works	Best For
Pair programming	Two developers work on the same item at the same machine	Complex logic, knowledge transfer, code that needs review
Mob programming	The whole team works on one item together	Critical path items, complex architectural decisions
Divide and conquer	Break the item into sub-tasks and assign them	Items that can be parallelized (e.g., frontend + backend + tests)
Unblock and return	One person resolves the blocker, then hands back	External dependencies, environment issues, access requests

Why Teams Resist Swarming

The most common objection: “It’s inefficient to have two people on one task.” This is only true if you measure efficiency as “percentage of time each person is writing new code.” If you measure efficiency as “how quickly value reaches production,” swarming is almost always faster because it reduces handoffs, wait time, and rework.

How Limiting WIP Exposes Workflow Issues

One of the most valuable effects of WIP limits is that they make hidden problems visible. When you cannot start new work, you are forced to confront the problems that slow existing work down.

Symptom When WIP Is Limited	Root Cause Exposed
“I’m idle because my PR is waiting for review”	Code review process is too slow
“I’m idle because I’m waiting for the test environment”	Not enough environments, or environments are not self-service
“I’m idle because I’m waiting for the product owner to clarify requirements”	Stories are not refined before being pulled into the sprint
“I’m idle because my build is broken and I can’t figure out why”	Build is not deterministic, or test suite is flaky
“I’m idle because another team hasn’t finished the API I depend on”	Architecture is too tightly coupled (see Architecture Decoupling)

Each of these is a bottleneck that was previously invisible because the team could always start something else. With WIP limits, these bottlenecks become obvious and demand attention.

Implementing WIP Limits

Step 1: Make WIP Visible

Before setting limits, make current WIP visible:

Count the number of items currently “in progress” for the team
Write this number on the board (physical or digital) every day
Most teams are shocked by how high it is. A team of 5 often has 15-20 items in progress.

Step 2: Set the Initial Limit

Calculate N+2 for your team
Add the limit to your board (e.g., a column header that says “In Progress (limit: 7)”)
Agree as a team that when the limit is reached, no new work starts

Step 3: Enforce the Limit

When someone tries to pull new work and the limit is reached, the team helps them find an existing item to work on
Track violations: how often does the team exceed the limit? What causes it?
Discuss in retrospectives: Is the limit too high? Too low? What bottlenecks are exposed?

Step 4: Reduce the Limit (Monthly)

Every month, consider reducing the limit by 1
Each reduction will expose new bottlenecks - this is the intended effect
Stop reducing when the team reaches a sustainable flow where items move from start to done predictably

Key Pitfalls

1. “We set a WIP limit but nobody enforces it”

A WIP limit that is not enforced is not a WIP limit. Enforcement requires a team agreement and a visible mechanism. If the board shows 10 items in progress and the limit is 7, the team should stop and address it immediately. This is a working agreement, not a suggestion.

2. “Developers are idle and management is uncomfortable”

This is the most common failure mode. Management sees “idle” developers and concludes WIP limits are wasteful. In reality, those “idle” developers are either swarming on existing work (which is productive) or the team has hit a genuine bottleneck that needs to be addressed. The discomfort is a signal that the system needs improvement.

3. “We have WIP limits but we also have expedite lanes for everything”

If every urgent request bypasses the WIP limit, you do not have a WIP limit. Expedite lanes should be rare - one per week at most. If everything is urgent, nothing is.

4. “We limit WIP per person but not per team”

Per-person WIP limits miss the point. The goal is to limit team WIP so that team members are incentivized to help each other. A per-person limit of 1 with no team limit still allows the team to have 8 items in progress simultaneously with no swarming.

Measuring Success

Metric	Target	Why It Matters
Work in progress	At or below team limit	Confirms the limit is being respected
Development cycle time	Decreasing	Confirms that less WIP leads to faster delivery
Items completed per week	Stable or increasing	Confirms that finishing more, starting less works
Time items spend blocked	Decreasing	Confirms bottlenecks are being addressed

Next Step

WIP limits expose problems. Metrics-Driven Improvement provides the framework for systematically addressing them.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Too Much WIP - the primary symptom that WIP limits address
Work Items Take Too Long - a symptom caused by excess work in progress
PRs Waiting for Review - a bottleneck that WIP limits expose
Unbounded WIP - the anti-pattern of having no limits on work in progress
Push-Based Work Assignment - the anti-pattern of assigning work rather than letting teams pull it
Work in Progress metric - how to measure and track WIP over time

4 - Metrics-Driven Improvement

Use DORA metrics and improvement kata to drive systematic delivery improvement.

Phase 3 - Optimize | Original content combining DORA recommendations and improvement kata

Improvement without measurement is guesswork. This page combines the DORA four key metrics with the improvement kata pattern to create a systematic, repeatable approach to getting better at delivery.

The Problem with Ad Hoc Improvement

Most teams improve accidentally. Someone reads a blog post, suggests a change at standup, and the team tries it for a week before forgetting about it. This produces sporadic, unmeasurable progress that is impossible to sustain.

Metrics-driven improvement replaces this with a disciplined cycle: measure where you are, define where you want to be, run a small experiment, measure the result, and repeat. The improvement kata provides the structure. DORA metrics provide the measures.

The Four DORA Metrics

The DORA research program (now part of Google Cloud) has identified four key metrics that predict software delivery performance. These are the metrics you should track throughout your CD migration.

1. Deployment Frequency

How often your team deploys to production.

Performance Level	Deployment Frequency
Elite	On-demand (multiple deploys per day)
High	Between once per day and once per week
Medium	Between once per week and once per month
Low	Between once per month and once every six months

What it tells you: How comfortable your team and pipeline are with deploying. Low frequency usually indicates manual gates, fear of deployment, or large batch sizes.

How to measure: Count the number of successful deployments to production per unit of time. Automated deploys count. Hotfixes count. Rollbacks do not.

2. Lead Time for Changes

The time from a commit being pushed to trunk to that commit running in production.

Performance Level	Lead Time
Elite	Less than one hour
High	Between one day and one week
Medium	Between one week and one month
Low	Between one month and six months

What it tells you: How efficient your pipeline is. Long lead times indicate slow builds, manual approval steps, or infrequent deployment windows.

How to measure: Record the timestamp when a commit merges to trunk and the timestamp when that commit is running in production. The difference is lead time. Track the median, not the mean (outliers distort the mean).

3. Change Failure Rate

The percentage of deployments that cause a failure in production requiring remediation (rollback, hotfix, or patch).

Performance Level	Change Failure Rate
Elite	0-15%
High	16-30%
Medium	16-30%
Low	46-60%

What it tells you: How effective your testing and validation pipeline is. High failure rates indicate gaps in test coverage, insufficient pre-production validation, or overly large changes.

How to measure: Track deployments that result in a degraded service, require rollback, or need a hotfix. Divide by total deployments. A “failure” is defined by the team - typically any incident that requires immediate human intervention.

4. Mean Time to Restore (MTTR)

How long it takes to recover from a failure in production.

Performance Level	Time to Restore
Elite	Less than one hour
High	Less than one day
Medium	Less than one day
Low	Between one week and one month

What it tells you: How resilient your system and team are. Long recovery times indicate manual rollback processes, poor observability, or insufficient incident response practices.

How to measure: Record the timestamp when a production failure is detected and the timestamp when service is fully restored. Track the median.

CI Health Metrics

DORA metrics are outcome metrics - they tell you how delivery is performing overall. CI health metrics are leading indicators that give you earlier feedback on the health of your integration practices. Problems in these metrics show up days or weeks before they surface in DORA numbers.

Track these alongside DORA metrics to catch issues before they compound.

Commits Per Day Per Developer

Aspect	Detail
What it measures	The average number of commits integrated to trunk per developer per day
How to measure	Count trunk commits (or merged pull requests) over a period and divide by the number of active developers and working days
Good target	2 or more per developer per day
Why it matters	Low commit frequency indicates large batch sizes, long-lived branches, or developers waiting to integrate. All of these increase merge risk and slow feedback.

If the number is low: Developers may be working on branches for too long, bundling unrelated changes into single commits, or facing barriers to integration (slow builds, complex merge processes). Investigate branch lifetimes and work decomposition.

If the number is unusually high: Verify that commits represent meaningful work rather than trivial fixes to pass a metric. Commit frequency is a means to smaller batches, not a goal in itself.

Build Success Rate

Aspect	Detail
What it measures	The percentage of CI builds that pass on the first attempt
How to measure	Divide the number of green builds by total builds over a period
Good target	90% or higher
Why it matters	A frequently broken build disrupts the entire team. Developers cannot integrate confidently when the build is unreliable, leading to longer feedback cycles and batching of changes.

If the number is low: Common causes include flaky tests, insufficient local validation before committing, or environmental inconsistencies between developer machines and CI. Start by identifying and quarantining flaky tests, then ensure developers can run a representative build locally before pushing.

If the number is high but DORA metrics are still lagging: The build may pass but take too long, or the build may not cover enough to catch real problems. Check build duration and test coverage.

Time to Fix a Broken Build

Aspect	Detail
What it measures	The elapsed time from a build breaking to the next green build on trunk
How to measure	Record the timestamp of the first red build and the timestamp of the next green build. Track the median.
Good target	Less than 10 minutes
Why it matters	A broken build blocks everyone. The longer it stays broken, the more developers stack changes on top of a broken baseline, compounding the problem. Fast fix times are a sign of strong CI discipline.

If the number is high: The team may not be treating broken builds as a stop-the-line event. Establish a team agreement: when the build breaks, fixing it takes priority over all other work. If builds break frequently and take long to fix, reduce change size so failures are easier to diagnose.

The DORA Recommended Practices

Behind these four metrics are 24 practices that the DORA research has shown to drive performance. They organize into five categories. Use this as a diagnostic tool: when a metric is lagging, look at the related practices to identify what to improve.

Continuous Delivery Practices

These directly affect your pipeline and deployment practices:

Version control for all production artifacts
Automated deployment processes
Continuous integration
Trunk-based development
Test automation
Test data management
Shift-left security
Continuous delivery (the ability to deploy at any time)

Architecture Practices

These affect how easily your system can be changed and deployed:

Loosely coupled architecture
Empowered teams that can choose their own tools
Teams that can test, deploy, and release independently

Product and Process Practices

These affect how work flows through the team:

Customer feedback loops
Value stream visibility
Working in small batches
Team experimentation

Lean Management Practices

These affect how the organization supports delivery:

Lightweight change approval processes
Monitoring and observability
Proactive notification
WIP limits
Visual management of workflow

Cultural Practices

These affect the environment in which teams operate:

Generative organizational culture (Westrum model)
Encouraging and supporting learning
Collaboration within and between teams
Job satisfaction
Transformational leadership

For a detailed breakdown, see the DORA Recommended Practices reference.

The Improvement Kata

The improvement kata is a four-step pattern from lean manufacturing adapted for software delivery. It provides the structure for turning DORA measurements into concrete improvements.

Step 1: Understand the Direction

Where does your CD migration need to go?

This is already defined by the phases of this migration guide. In Phase 3, your direction is: smaller batches, faster flow, and higher confidence in every deployment.

Step 2: Grasp the Current Condition

Measure your current DORA metrics. Be honest - the point is to understand reality, not to look good.

Practical approach:

Collect two weeks of data for all four DORA metrics
Plot the data - do not just calculate averages. Look at the distribution.
Identify which metric is furthest from your target
Investigate the related practices to understand why

Example current condition:

Metric	Current	Target	Gap
Deployment frequency	Weekly	Daily	5x improvement needed
Lead time	3 days	< 1 day	Pipeline is slow or has manual gates
Change failure rate	25%	< 15%	Test coverage or change size issue
MTTR	4 hours	< 1 hour	Rollback is manual

Step 3: Establish the Next Target Condition

Do not try to fix everything at once. Pick one metric and define a specific, measurable, time-bound target.

Good target: “Reduce lead time from 3 days to 1 day within the next 4 weeks.”

Bad target: “Improve our deployment pipeline.” (Too vague, no measure, no deadline.)

Step 4: Experiment Toward the Target

Design a small experiment that you believe will move the metric toward the target. Run it. Measure the result. Adjust.

The experiment format:

Element	Description
Hypothesis	“If we [action], then [metric] will [improve/decrease] because [reason].”
Action	What specifically will you change?
Duration	How long will you run the experiment? (Typically 1-2 weeks)
Measure	How will you know if it worked?
Decision criteria	What result would cause you to keep, modify, or abandon the change?

Example experiment:

Hypothesis: If we parallelize our integration test suite, lead time will drop from 3 days to under 2 days because 60% of lead time is spent waiting for tests to complete.
Action: Split the integration test suite into 4 parallel runners.
Duration: 2 weeks.
Measure: Median lead time for commits merged during the experiment period.
Decision criteria: Keep if lead time drops below 2 days. Modify if it drops but not enough. Abandon if it has no effect or introduces flakiness.

The Cycle Repeats

After each experiment:

Measure the result
Update your understanding of the current condition
If the target is met, pick the next metric to improve
If the target is not met, design another experiment

This creates a continuous improvement loop. Each cycle takes 1-2 weeks. Over months, the cumulative effect is dramatic.

Connecting Metrics to Action

When a metric is lagging, use this guide to identify where to focus.

Low Deployment Frequency

Possible Cause	Investigation	Action
Manual approval gates	Map the approval chain	Automate or eliminate non-value-adding approvals
Fear of deployment	Ask the team what they fear	Address the specific fear (usually testing gaps)
Large batch size	Measure changes per deploy	Implement small batches practices
Deploy process is manual	Time the deploy process	Automate the deployment pipeline

Long Lead Time

Possible Cause	Investigation	Action
Slow builds	Time each pipeline stage	Optimize the slowest stage (often tests)
Waiting for environments	Track environment wait time	Implement self-service environments
Waiting for approval	Track approval wait time	Reduce approval scope or automate
Large changes	Measure commit size	Reduce batch size

High Change Failure Rate

Possible Cause	Investigation	Action
Insufficient test coverage	Measure coverage by area	Add tests for the areas that fail most
Tests pass but production differs	Compare test and prod environments	Make environments more production-like
Large, risky changes	Measure change size	Reduce batch size, use feature flags
Configuration drift	Audit configuration differences	Externalize and version configuration

Long MTTR

Possible Cause	Investigation	Action
Rollback is manual	Time the rollback process	Automate rollback
Hard to identify root cause	Review recent incidents	Improve observability and alerting
Hard to deploy fixes quickly	Measure fix lead time	Ensure pipeline supports rapid hotfix deployment
Dependencies fail in cascade	Map failure domains	Improve architecture decoupling

Pipeline Visibility

Metrics only drive improvement when people see them. Pipeline visibility means making the current state of your build and deployment pipeline impossible to ignore. When the build is red, everyone should know immediately - not when someone checks a dashboard twenty minutes later.

Making Build Status Visible

The most effective teams use ambient visibility - information that is passively available without anyone needing to seek it out.

Build radiators: A large monitor in the team area showing the current pipeline status. Green means the build is passing. Red means it is broken. The radiator should be visible from every desk in the team space. For remote teams, a persistent widget in the team chat channel serves the same purpose.

Browser extensions and desktop notifications: Tools like CCTray, BuildNotify, or CI server plugins can display build status in the system tray or browser toolbar. These provide individual-level ambient awareness without requiring a shared physical space.

Chat integrations: Post build results to the team channel automatically. Keep these concise - a green checkmark or red alert with a link to the build is enough. Verbose build logs in chat become noise.

Notification Best Practices

Notifications are powerful when used well and destructive when overused. The goal is to notify the right people at the right time with the right level of urgency.

When to notify:

Build breaks on trunk - notify the whole team immediately
Build is fixed - notify the whole team (this is a positive signal worth reinforcing)
Deployment succeeds - notify the team channel (low urgency)
Deployment fails - notify the on-call and the person who triggered it

When not to notify:

Every commit or pull request update (too noisy)
Successful builds on feature branches (nobody else needs to know)
Metrics that have not changed (no signal in “things are the same”)

Avoiding notification fatigue: If your team ignores notifications, you have too many of them. Audit your notification channels quarterly. Remove any notification that the team consistently ignores. A notification that nobody reads is worse than no notification at all - it trains people to tune out the channel entirely.

Building a Metrics Dashboard

Make your DORA metrics and CI health metrics visible to the team at all times. A dashboard on a wall monitor or a shared link is ideal.

Essential Information

Organize your dashboard around three categories:

Current status - what is happening right now:

Pipeline status (green/red) for trunk and any active deployments
Current values for all four DORA metrics
Active experiment description and target condition

Trends - where are we heading:

Trend lines showing direction over the past 4-8 weeks
CI health metrics (build success rate, time to fix, commit frequency) plotted over time
Whether the current improvement target is on track

Team health - how is the team doing:

Current improvement target highlighted
Days since last production incident
Number of experiments completed this quarter

Dashboard Anti-Patterns

The vanity dashboard: Displays only metrics that look good. If your dashboard never shows anything concerning, it is not useful. Include metrics that challenge the team, not just ones that reassure management.

The everything dashboard: Crams dozens of metrics, charts, and tables onto one screen. Nobody can parse it at a glance, so nobody looks at it. Limit your dashboard to 6-8 key indicators. If you need more detail, put it on a drill-down page.

The stale dashboard: Data is updated manually and falls behind. Automate data collection wherever possible. A dashboard showing last month’s numbers is worse than no dashboard - it creates false confidence.

The blame dashboard: Ties metrics to individual developers rather than teams. This creates fear and gaming rather than improvement. Always present metrics at the team level.

Keep it simple. A spreadsheet updated weekly is better than a sophisticated dashboard that nobody maintains. The goal is visibility, not tooling sophistication.

Key Pitfalls

1. “We measure but don’t act”

Measurement without action is waste. If you collect metrics but never run experiments, you are creating overhead with no benefit. Every measurement should lead to a hypothesis. Every hypothesis should lead to an experiment. See Hypothesis-Driven Development for the full lifecycle.

2. “We use metrics to compare teams”

DORA metrics are for teams to improve themselves, not for management to rank teams. Using metrics for comparison creates incentives to game the numbers. Each team should own its own metrics and its own improvement targets.

3. “We try to improve all four metrics at once”

Focus on one metric at a time. Improving deployment frequency and change failure rate simultaneously often requires conflicting actions. Pick the biggest bottleneck, address it, then move to the next.

4. “We abandon experiments too quickly”

Most experiments need at least two weeks to show results. One bad day is not a reason to abandon an experiment. Set the duration up front and commit to it.

Measuring Success

Indicator	Target	Why It Matters
Experiments per month	2-4	Confirms the team is actively improving
Metrics trending in the right direction	Consistent improvement over 3+ months	Confirms experiments are having effect
Team can articulate current condition and target	Everyone on the team knows	Confirms improvement is a shared concern
Improvement items in backlog	Always present	Confirms improvement is treated as a deliverable

Next Step

Metrics tell you what to improve. Retrospectives provide the team forum for deciding how to improve it.

Deployment Frequency - one of the four key DORA metrics
Lead Time - one of the four key DORA metrics
Change Fail Rate - one of the four key DORA metrics
Mean Time to Repair - one of the four key DORA metrics
DORA Recommended Practices - the 24 practices that drive delivery performance
Retrospectives - the team forum for acting on what metrics reveal
Hypothesis-Driven Development - the practice of treating every change as a testable experiment

5 - Retrospectives

Continuously improve the delivery process through structured reflection.

Phase 3 - Optimize

A retrospective is the team’s primary mechanism for turning observations into improvements. Without effective retrospectives, WIP limits expose problems that nobody addresses, metrics trend in the wrong direction with no response, and the CD migration stalls.

Why Retrospectives Matter for CD Migration

Every practice in this guide - trunk-based development, small batches, WIP limits, metrics-driven improvement - generates signals about what is working and what is not. Retrospectives are where the team processes those signals and decides what to change.

Teams that skip retrospectives or treat them as a checkbox exercise consistently stall at whatever maturity level they first reach. Teams that run effective retrospectives continuously improve, week after week, month after month.

The Five-Part Structure

An effective retrospective follows a structured format that prevents it from devolving into a venting session or a status meeting. This five-part structure ensures the team moves from observation to action.

Part 1: Review the Mission (5 minutes)

Start by reminding the team of the larger goal. In the context of a CD migration, this might be:

“Our mission this quarter is to deploy to production at least once per day.”
“We are working toward eliminating manual gates in our pipeline.”
“Our goal is to reduce lead time from 3 days to under 1 day.”

This grounding prevents the retrospective from focusing on minor irritations and keeps the conversation aligned with what matters.

Part 2: Review the KPIs (10 minutes)

Present the team’s current metrics. For a CD migration, these are typically the DORA metrics plus any team-specific measures from Metrics-Driven Improvement.

Metric	Last Period	This Period	Trend
Deployment frequency	3/week	4/week	Improving
Lead time (median)	2.5 days	2.1 days	Improving
Change failure rate	22%	18%	Improving
MTTR	3 hours	3.5 hours	Declining
WIP (average)	8 items	6 items	Improving

Do not skip this step. Without data, the retrospective becomes a subjective debate where the loudest voice wins. With data, the conversation focuses on what the numbers show and what to do about them.

Part 3: Review Experiments (10 minutes)

Review the outcomes of any experiments the team ran since the last retrospective.

For each experiment:

What was the hypothesis? Remind the team what you were testing.
What happened? Present the data.
What did you learn? Even failed experiments teach you something.
What is the decision? Keep, modify, or abandon.

Example:

Experiment: Parallelize the integration test suite to reduce lead time.
Hypothesis: Lead time would drop from 2.5 days to under 2 days.
Result: Lead time dropped to 2.1 days. The parallelization worked, but environment setup time is now the bottleneck.
Decision: Keep the parallelization. New experiment: investigate self-service test environments.

Part 4: Check Goals (10 minutes)

Review any improvement goals or action items from the previous retrospective.

Completed: Acknowledge and celebrate. This is important - it reinforces that improvement work matters.
In progress: Check for blockers. Does the team need to adjust the approach?
Not started: Why not? Was it deprioritized, blocked, or forgotten? If improvement work is consistently not started, the team is not treating improvement as a deliverable (see below).

Part 5: Open Conversation (25 minutes)

This is the core of the retrospective. The team discusses:

What is working well that we should keep doing?
What is not working that we should change?
What new problems or opportunities have we noticed?

Facilitation techniques for this section:

Technique	How It Works	Best For
Start/Stop/Continue	Each person writes items in three categories	Quick, structured, works with any team
4Ls (Liked, Learned, Lacked, Longed For)	Broader categories that capture emotional responses	Teams that need to process frustration or celebrate wins
Timeline	Plot events on a timeline and discuss turning points	After a particularly eventful sprint or incident
Dot voting	Everyone gets 3 votes to prioritize discussion topics	When there are many items and limited time

From Conversation to Commitment

The open conversation must produce concrete action items. Vague commitments like “we should communicate better” are worthless. Good action items are:

Specific: “Add a Slack notification when the build breaks” (not “improve communication”)
Owned: “Alex will set this up by Wednesday” (not “someone should do this”)
Measurable: “We will know this worked if build break response time drops below 10 minutes”
Time-bound: “We will review the result at the next retrospective”

Limit action items to 1-3 per retrospective. More than three means nothing gets done. One well-executed improvement is worth more than five abandoned ones.

Psychological Safety Is a Prerequisite

A retrospective only works if team members feel safe to speak honestly about what is not working. Without psychological safety, retrospectives produce sanitized, non-actionable discussion.

Signs of Low Psychological Safety

Only senior team members speak
Nobody mentions problems - everything is “fine”
Issues that everyone knows about are never raised
Team members vent privately after the retrospective instead of during it
Action items are always about tools or processes, never about behaviors

Building Psychological Safety

Practice	Why It Helps
Leader speaks last	Prevents the leader’s opinion from anchoring the discussion
Anonymous input	Use sticky notes or digital tools where input is anonymous initially
Blame-free language	“The deploy failed” not “You broke the deploy”
Follow through on raised issues	Nothing destroys safety faster than raising a concern and having it ignored
Acknowledge mistakes openly	Leaders who admit their own mistakes make it safe for others to do the same
Separate retrospective from performance review	If retro content affects reviews, people will not be honest

Treat Improvement as a Deliverable

The most common failure mode for retrospectives is producing action items that never get done. This happens when improvement work is treated as something to do “when we have time” - which means never.

Make Improvement Visible

Add improvement items to the same board as feature work
Include improvement items in WIP limits
Track improvement items through the same workflow as any other deliverable

Allocate Capacity

Reserve a percentage of team capacity for improvement work. Common allocations:

Allocation	Approach
20% continuous	One day per week (or equivalent) dedicated to improvement, tooling, and tech debt
Dedicated improvement sprint	Every 4th sprint is entirely improvement-focused
Improvement as first pull	When someone finishes work and the WIP limit allows, the first option is an improvement item

The specific allocation matters less than having one. A team that explicitly budgets 10% for improvement will improve more than a team that aspires to 20% but never protects the time.

Retrospective Cadence

Cadence	Best For	Caution
Weekly	Teams in active CD migration, teams working through major changes	Can feel like too many meetings if not well-facilitated
Bi-weekly	Teams in steady state with ongoing improvement	Most common cadence
After incidents	Any team	Incident retrospectives (postmortems) are separate from regular retrospectives
Monthly	Mature teams with well-established improvement habits	Too infrequent for teams early in their migration

During active phases of a CD migration (Phases 1-3), weekly retrospectives are recommended. Once the team reaches Phase 4, bi-weekly is usually sufficient.

Running Your First CD Migration Retrospective

If your team has not been running effective retrospectives, start here:

Before the Retrospective

Collect your DORA metrics for the past two weeks
Review any action items from the previous retrospective (if applicable)
Prepare a shared document or board with the five-part structure

During the Retrospective (60 minutes)

Review mission (5 min): State your CD migration goal for this phase
Review KPIs (10 min): Present the DORA metrics. Ask: “What do you notice?”
Review experiments (10 min): Discuss any experiments that were run
Check goals (10 min): Review action items from last time
Open conversation (25 min): Use Start/Stop/Continue for the first time - it is the simplest format

After the Retrospective

Publish the action items where the team will see them daily
Assign owners and due dates
Add improvement items to the team board
Schedule the next retrospective

Key Pitfalls

1. “Our retrospectives always produce the same complaints”

If the same issues surface repeatedly, the team is not executing on its action items. Check whether improvement work is being prioritized alongside feature work. If it is not, no amount of retrospective technique will help.

2. “People don’t want to attend because nothing changes”

This is a symptom of the same problem - action items are not executed. The fix is to start small: commit to one action item per retrospective, execute it completely, and demonstrate the result at the next retrospective. Success builds momentum.

3. “The retrospective turns into a blame session”

The facilitator must enforce blame-free language. Redirect “You did X wrong” to “When X happened, the impact was Y. How can we prevent Y?” If blame is persistent, the team has a psychological safety problem that needs to be addressed separately.

4. “We don’t have time for retrospectives”

A team that does not have time to improve will never improve. A 60-minute retrospective that produces one executed improvement is the highest-leverage hour of the entire sprint.

Measuring Success

Indicator	Target	Why It Matters
Retrospective attendance	100% of team	Confirms the team values the practice
Action items completed	> 80% completion rate	Confirms improvement is treated as a deliverable
DORA metrics trend	Improving quarter over quarter	Confirms retrospectives lead to real improvement
Team engagement	Voluntary contributions increasing	Confirms psychological safety is present

Next Step

With metrics-driven improvement and effective retrospectives, you have the engine for continuous improvement. The final optimization step is Architecture Decoupling - ensuring your system’s architecture does not prevent you from deploying independently.

Content contributed by Dojo Consortium, licensed under CC BY 4.0.

Team Burnout - a symptom that effective retrospectives help detect and address early
Deadline-Driven Development - an anti-pattern that retrospectives can surface and challenge
Velocity as Individual Metric - an anti-pattern that undermines the psychological safety retrospectives require
Metrics-Driven Improvement - provides the data that retrospectives use to drive decisions
Limiting WIP - WIP limits expose problems that retrospectives turn into action items
DORA Recommended Practices - the capability framework that informs improvement priorities

6 - Architecture Decoupling

Enable independent deployment of components by decoupling architecture boundaries.

Phase 3 - Optimize | Original content based on Dojo Consortium delivery journey patterns

You cannot deploy independently if your architecture requires coordinated releases. This page describes the three architecture states teams encounter on the journey to continuous deployment and provides practical strategies for moving from entangled to loosely coupled.

Why Architecture Matters for CD

Every practice in this guide - small batches, feature flags, WIP limits - assumes that your team can deploy its changes independently. But if your application is a monolith where changing one module requires retesting everything, or a set of microservices with tightly coupled APIs, independent deployment is impossible regardless of how good your practices are.

Architecture is either an enabler or a blocker for continuous deployment. There is no neutral.

Three Architecture States

The Delivery System Improvement Journey describes three states that teams move through. Most teams start entangled. The goal is to reach loosely coupled.

State 1: Entangled

In an entangled architecture, everything is connected to everything. Changes in one area routinely break other areas. Teams cannot deploy independently.

Characteristics:

Shared database schemas with no ownership boundaries
Circular dependencies between modules or services
Deploying one service requires deploying three others at the same time
Integration testing requires the entire system to be running
A single team’s change can block every other team’s release
“Big bang” releases on a fixed schedule

Impact on delivery:

Metric	Typical State
Deployment frequency	Monthly or quarterly (because coordinating releases is hard)
Lead time	Weeks to months (because changes wait for the next release train)
Change failure rate	High (because big releases mean big risk)
MTTR	Long (because failures cascade across boundaries)

How you got here: Entanglement is the natural result of building quickly without deliberate architectural boundaries. It is not a failure - it is a stage that almost every system passes through.

State 2: Tightly Coupled

In a tightly coupled architecture, there are identifiable boundaries between components, but those boundaries are leaky. Teams have some independence, but coordination is still required for many changes.

Characteristics:

Services exist but share a database or use synchronous point-to-point calls
API contracts exist but are not versioned - breaking changes require simultaneous updates
Teams can deploy some changes independently, but cross-cutting changes require coordination
Integration testing requires multiple services but not the entire system
Release trains still exist but are smaller and more frequent

Impact on delivery:

Metric	Typical State
Deployment frequency	Weekly to bi-weekly
Lead time	Days to a week
Change failure rate	Moderate (improving but still affected by coupling)
MTTR	Hours (failures are more isolated but still cascade sometimes)

State 3: Loosely Coupled

In a loosely coupled architecture, components communicate through well-defined interfaces, own their own data, and can be deployed independently without coordinating with other teams.

Characteristics:

Each service owns its own data store - no shared databases
APIs are versioned; consumers and producers can be updated independently
Asynchronous communication (events, queues) is used where possible
Each team can deploy without coordinating with any other team
Services are designed to degrade gracefully if a dependency is unavailable
No release trains - each team deploys when ready

Impact on delivery:

Metric	Typical State
Deployment frequency	On-demand (multiple times per day)
Lead time	Hours
Change failure rate	Low (small, isolated changes)
MTTR	Minutes (failures are contained within service boundaries)

Moving from Entangled to Tightly Coupled

This is the first and most difficult transition. It requires establishing boundaries where none existed before.

Strategy 1: Identify Natural Seams

Look for places where the system already has natural boundaries, even if they are not enforced:

Different business domains: Orders, payments, inventory, and user accounts are different domains even if they live in the same codebase.
Different rates of change: Code that changes weekly and code that changes yearly should not be in the same deployment unit.
Different scaling needs: Components with different load profiles benefit from separate deployment.
Different team ownership: If different teams work on different parts of the codebase, those parts are candidates for separation.

Strategy 2: Strangler Fig Pattern

Instead of rewriting the system, incrementally extract components from the monolith.

Strangler Fig Pattern: incremental extraction steps

Step 1: Route all traffic through a facade/proxy
Step 2: Build the new component alongside the old
Step 3: Route a small percentage of traffic to the new component
Step 4: Validate correctness and performance
Step 5: Route all traffic to the new component
Step 6: Remove the old code

Key rule: The strangler fig pattern must be done incrementally. If you try to extract everything at once, you are doing a rewrite, not a strangler fig.

Strategy 3: Define Ownership Boundaries

Assign clear ownership of each module or component to a single team. Ownership means:

The owning team decides the API contract
The owning team deploys the component
Other teams consume the API, not the internal implementation
Changes to the API contract require agreement from consumers (but not simultaneous deployment)

What to Avoid

The “big rewrite”: Rewriting a monolith from scratch almost always fails. Use the strangler fig pattern instead.
Premature microservices: Do not split into microservices until you have clear domain boundaries and team ownership. Microservices with unclear boundaries are a distributed monolith - the worst of both worlds.
Shared databases across services: This is the most common coupling mechanism. If two services share a database, they cannot be deployed independently because a schema change in one service can break the other.

Moving from Tightly Coupled to Loosely Coupled

This transition is about hardening the boundaries that were established in the previous step.

Strategy 1: Eliminate Shared Data Stores

If two services share a database, one of three things needs to happen:

One service owns the data, the other calls its API. The dependent service no longer accesses the database directly.
The data is duplicated. Each service maintains its own copy, synchronized via events.
The shared data becomes a dedicated data service. Both services consume from a service that owns the data.

Eliminating shared databases: before and after patterns

BEFORE (shared database):
  Service A → [Shared DB] ← Service B

AFTER (option 1 - API ownership):
  Service A → [DB A]
  Service B → Service A API → [DB A]

AFTER (option 2 - event-driven duplication):
  Service A → [DB A] → Events → Service B → [DB B]

AFTER (option 3 - data service):
  Service A → Data Service → [DB]
  Service B → Data Service → [DB]

Strategy 2: Version Your APIs

API versioning allows consumers and producers to evolve independently.

Rules for API versioning:

Never make a breaking change without a new version. Adding fields is non-breaking. Removing fields is breaking. Changing field types is breaking.
Support at least two versions simultaneously. This gives consumers time to migrate.
Deprecate old versions with a timeline. “Version 1 will be removed on date X.”
Use consumer-driven contract tests to verify compatibility. See Contract Testing.

Strategy 3: Prefer Asynchronous Communication

Synchronous calls (HTTP, gRPC) create temporal coupling: if the downstream service is slow or unavailable, the upstream service is also affected.

Communication Style	Coupling	When to Use
Synchronous (HTTP/gRPC)	Temporal + behavioral	When the caller needs an immediate response
Asynchronous (events/queues)	Behavioral only	When the caller does not need an immediate response
Event-driven (publish/subscribe)	Minimal	When the producer does not need to know about consumers

Prefer asynchronous communication wherever the business requirements allow it. Not every interaction needs to be synchronous.

Strategy 4: Design for Failure

In a loosely coupled system, dependencies will be unavailable sometimes. Design for this:

Circuit breakers: Stop calling a failing dependency after N failures. Return a degraded response instead.
Timeouts: Set aggressive timeouts on all external calls. A 30-second timeout on a service that should respond in 100ms is not a timeout - it is a hang.
Bulkheads: Isolate failures so that one failing dependency does not consume all resources.
Graceful degradation: Define what the user experience should be when a dependency is down. “Recommendations unavailable” is better than a 500 error.

Practical Steps for Architecture Decoupling

Step 1: Map Dependencies

Before changing anything, understand what you have:

Draw a dependency graph. Which components depend on which? Where are the shared databases?
Identify deployment coupling. Which components must be deployed together? Why?
Identify the highest-impact coupling. Which coupling most frequently blocks independent deployment?

Step 2: Establish the First Boundary

Pick one component to decouple. Choose the one with the highest impact and lowest risk:

Apply the strangler fig pattern to extract it
Define a clear API contract
Move its data to its own data store
Deploy it independently

Step 3: Repeat

Take the next highest-impact coupling and address it. Each decoupling makes the next one easier because the team learns the patterns and the remaining system is simpler.

Key Pitfalls

1. “We need to rewrite everything before we can deploy independently”

No. Decoupling is incremental. Extract one component, deploy it independently, prove the pattern works, then continue. A partial decoupling that enables one team to deploy independently is infinitely more valuable than a planned rewrite that never finishes.

2. “We split into microservices but our lead time got worse”

Microservices add operational complexity (more services to deploy, monitor, and debug). If you split without investing in deployment automation, observability, and team autonomy, you will get worse, not better. Microservices are a tool for organizational scaling, not a silver bullet for delivery speed.

3. “Teams keep adding new dependencies that recouple the system”

Architecture decoupling requires governance. Establish architectural principles (e.g., “no shared databases”) and enforce them through automated checks (e.g., dependency analysis in CI) and architecture reviews for cross-boundary changes.

4. “We can’t afford the time to decouple”

You cannot afford not to. Every week spent doing coordinated releases is a week of delivery capacity lost to coordination overhead. The investment in decoupling pays for itself quickly through increased deployment frequency and reduced coordination cost.

Measuring Success

Metric	Target	Why It Matters
Teams that can deploy independently	Increasing	The primary measure of decoupling
Coordinated releases per quarter	Decreasing toward zero	Confirms coupling is being eliminated
Deployment frequency per team	Increasing independently	Confirms teams are not blocked by each other
Cross-team dependencies per feature	Decreasing	Confirms architecture supports independent work

Next Step

With optimized flow, small batches, metrics-driven improvement, and a decoupled architecture, your team is ready for the final phase. Continue to Phase 4: Deliver on Demand.

Coordinated Deployments - the primary symptom that architecture coupling causes
Tightly Coupled Monolith - the anti-pattern of a monolith with no internal boundaries
Distributed Monolith - the anti-pattern of microservices that still require coordinated releases
Premature Microservices - splitting into services before domain boundaries are clear
Contract Testing - the testing approach that enables independent deployment of services
Progressive Rollout - the deployment strategy enabled by a decoupled architecture
Team Alignment to Code - the organizational counterpart: matching team boundaries to the code boundaries that decoupling creates

7 - Team Alignment to Code

Match team ownership boundaries to code boundaries so each team can build, test, and deploy its domain independently.

Phase 3 - Optimize | Teams that own a domain end-to-end can deploy independently. Teams organized around technical layers cannot.

How Team Structure Shapes Code

The way an organization communicates produces the architecture it builds. When communication flows between layers - frontend team talks to backend team, backend team talks to database team - the software reflects those communication lines. Requests for the UI layer go to one team. Requests for the API layer go to another. The result is software that is horizontally layered in the same pattern as the organization.

Layer teams produce layered architectures. The layers are coupled not because the engineers chose to couple them but because every feature requires coordination across team boundaries. The coupling is structural, not accidental.

Domain teams produce domain boundaries. When one team owns everything inside a business domain - the user interface, the business logic, the data store, and the deployment pipeline - they can make changes within that domain without coordinating with other teams. The interfaces between domains are explicit and stable because that is how the teams communicate.

This is not a coincidence. Architecture reflects the ownership structure of the people who built it.

What Aligned Ownership Looks Like

A team with aligned ownership can answer yes to all of the following:

Can this team deploy a change to production without waiting for another team?
Does this team own everything inside its domain boundary - all layers, all data, and all consumer interfaces?
Does this team define and version the contracts its domain exposes to other domains?
Is this team responsible for production incidents in its domain?

Two team patterns achieve aligned ownership in practice.

A full-stack product team owns the complete user-facing surface for a feature area - from the UI components a user interacts with down through the business logic and the database. The team has no hard dependency on a separate frontend or backend team. One team ships the entire vertical slice.

A subdomain product team owns a service or set of services representing a bounded business capability. Some subdomain teams own a user-facing surface alongside their backend logic. Others - a tax calculation service, a shipping rates engine, an identity provider - have no UI at all. Their consumer interface is entirely an API, consumed by other teams rather than by end users directly. Both are fully aligned: the team owns everything within the boundary, and the boundary is what its consumers depend on - whether that is a UI, an API, or both. A slice is done when the consumer interface satisfies the agreed behavior for its callers.

Both patterns share the same structure: one team, one deployable, full ownership. The team owns all layers within its boundary, the authority to deploy that boundary independently, and accountability for its operational behavior.

What Misalignment Looks Like

Three patterns consistently produce deployment coupling.

Component or layer teams. A frontend team, a backend team, and a database team all work on the same product. Every feature requires coordination across all three. No team can deploy independently because no team owns a full vertical slice.

Feature teams without domain ownership. Teams are organized around feature areas, but each feature area spans multiple services owned by other teams. The feature team coordinates with service owners for every change. The service owners become a shared resource that feature teams queue against.

The pillar model. A platform team owns all infrastructure. A shared services team owns cross-cutting concerns. Product teams own the business logic but depend on the other two for deployment. A change that touches infrastructure or shared services requires the product team to file a ticket and wait.

The telltale sign in all three cases: a team cannot estimate their own delivery date because it depends on other teams’ schedules.

The Relationship Between Team Alignment and Architecture

Team alignment and architecture reinforce each other. A decoupled architecture makes it possible to draw clean team boundaries. Clean team boundaries prevent the architecture from recoupling.

When team boundaries and code boundaries match:

Each team modifies code that only they own. Merge conflicts between teams disappear.
Each team’s pipeline validates only their domain. Shared pipeline queues disappear.
Each team deploys on their own schedule. Release trains disappear.

When they do not match, architecture and ownership drift together. A team that technically “owns” a service but in practice coordinates with three other teams for every change is not an independent deployment unit regardless of what the org chart says.

See Architecture Decoupling for the technical strategies to establish independent service boundaries. See Tightly Coupled Monolith for the architecture anti-pattern that misaligned ownership produces over time.

graph TD
    classDef aligned fill:#0d7a32,stroke:#0a6128,color:#fff
    classDef misaligned fill:#a63123,stroke:#8a2518,color:#fff
    classDef boundary fill:#224968,stroke:#1a3a54,color:#fff

    subgraph good ["Aligned: Domain Teams"]
        G1["Payments Team\nUI + Logic + DB + Pipeline"]:::aligned
        G2["Inventory Team\nUI + Logic + DB + Pipeline"]:::aligned
        G3["Accounts Team\nUI + Logic + DB + Pipeline"]:::aligned
        G4["Stable API Contracts"]:::boundary
        G1 --> G4
        G2 --> G4
        G3 --> G4
    end

    subgraph bad ["Misaligned: Layer Teams"]
        L1["Frontend Team\nAll UI across all domains"]:::misaligned
        L2["Backend Team\nAll logic across all domains"]:::misaligned
        L3["Database Team\nAll data across all domains"]:::misaligned
        L4["Coordinated Release Required"]:::boundary
        L1 --> L4
        L2 --> L4
        L3 --> L4
    end

How to Align Teams to Code

Step 1: Map who modifies what

Before changing anything, understand the actual ownership pattern. Use commit history to identify which teams (or individuals acting as de facto teams) modify which files and services.

Pull commit history for the last three months: git log --format="%ae %f" | sort | uniq -c
Map authors to their team. Identify the files each team touches most.
Highlight files that multiple teams touch frequently. These are the coupling points.
Identify services or modules where changes from one team consistently require changes from another.

The result is a map of actual ownership versus nominal ownership. In most organizations these diverge significantly.

Step 2: Identify natural domain boundaries

Natural domain boundaries exist in most codebases - they are just not enforced by team structure. Look for:

Business capabilities. What does this system do? Separate business functions - billing, shipping, authentication, reporting - that could be operated independently are candidate domains.
Data ownership. Which tables or data stores does each part of the system read and write? Data that is exclusively owned by one functional area belongs in that domain.
Rate of change. Code that changes weekly for business reasons and code that changes monthly for infrastructure reasons should be in different domains with different teams.
Existing team knowledge. Where do engineers already have strong concentrated expertise? Domain boundaries often match knowledge boundaries.

Draw a candidate domain map. Each domain should be a bounded set of business capability that one team can own end-to-end. Do not force domains to map to the current team structure - let the business capabilities define the boundaries first.

Step 3: Assign end-to-end ownership

For each candidate domain identified in Step 2, assign a single team. The rules:

One team per domain. Shared ownership produces neither ownership. If a domain has two owners, pick one.
Full stack. The owning team is responsible for all layers within the domain - UI, logic, data. If the current team lacks skills at some layer, plan for cross-training or re-staffing, but do not address the skill gap by keeping a separate layer team.
Deployment authority. The owning team merges to trunk and controls the deployment pipeline for their domain. No other team can block their deployment.
Operational accountability. The owning team is paged for production issues in their domain. On-call for the domain is owned by the same people who build it.

Document the domain boundaries explicitly: what services, data stores, and interfaces belong to each team.

Step 4: Define contracts at boundaries

Once teams own their domains, the interfaces between domains must be made explicit. Implicit interfaces - shared databases, undocumented internal calls, assumed response shapes - break independent deployment.

For each boundary between domains:

API contracts. Define the request and response shapes the consuming team depends on. Use OpenAPI or an equivalent schema. Commit it to the producer’s repository.
Event contracts. For asynchronous communication, define the event schema and the guarantees the producer makes (ordering, at-least-once vs. exactly-once, schema evolution rules).
Versioning. Establish a versioning policy. Additive changes are non-breaking. Removing or changing field semantics requires a new version. Both old and new versions are supported for a defined deprecation period.
Contract tests. Write tests that verify the producer honors the contract. Write tests that verify the consumer handles the contract correctly. See Contract Testing for implementation guidance.

Teams should not proceed to separate deployment pipelines until contracts are explicit and tested. An implicit contract that breaks silently is worse than a coordinated deployment.

Step 5: Separate deployment pipelines

With explicit contracts in place, each team can operate an independent pipeline for their domain.

Each team’s pipeline validates only their domain’s tests and contracts.
Pipeline triggers are scoped to the files the team owns - changes to another domain’s files do not trigger this team’s pipeline.
Each team deploys from their pipeline on their own schedule, without waiting for other teams.

For teams that share a repository but own distinct domains, use path-filtered triggers and separate pipeline configurations. See Multiple Teams, Single Deployable for a worked example of this pattern when teams share a modular monolith.

Objection	Response
“We don’t have enough senior engineers to staff every domain team fully.”	Domain teams do not need to be large. A team of two to three engineers with full ownership of a well-scoped domain delivers faster than six engineers on a layer team waiting for each other. Start with the highest-priority domains and staff others incrementally.
“Our engineers are specialists. The frontend people can’t own database code.”	Ownership does not require equal expertise at every layer - it requires the team to be responsible and to develop capability over time. Pair frontend specialists with backend engineers on the same team. The skill gap closes faster inside a team than across team boundaries.
“We tried domain teams before and they reinvented everything separately.”	Reinvention happens when platform capabilities are not shared effectively, not because of domain ownership. Separate domain ownership (what business capabilities each team is responsible for) from platform ownership (shared infrastructure, frameworks, and observability tooling).
“Business stakeholders are used to requesting work from the layer teams.”	Stakeholders adapt quickly when domain teams ship faster and with less coordination. Reframe the conversation: stakeholders talk to the team that owns the outcome, not the team that owns the layer.
“Our architecture doesn’t have clean domain boundaries yet.”	Start with the organizational change anyway. Teams aligned to emerging domain boundaries will drive the architectural cleanup faster than a centralized architecture effort without aligned ownership. The two reinforce each other.

Measuring Success

Metric	Target	Why It Matters
Deployment frequency per team	Increasing per team	Confirms teams can deploy without waiting for others
Cross-team dependencies per feature	Decreasing toward zero	Confirms domain boundaries are holding
Development cycle time	Decreasing	Teams that own their domain wait on fewer external dependencies
Production incidents attributed to another team’s change	Decreasing	Confirms ownership boundaries match deployment boundaries
Teams blocked on a release window they did not control	Decreasing toward zero	The primary organizational symptom of misalignment

Architecture Decoupling - the technical counterpart to team alignment; both must move together
Multiple Teams, Single Deployable - pipeline pattern for teams sharing a modular monolith before full service separation
Horizontal Slicing - the work decomposition anti-pattern that layer team structures encourage
Tightly Coupled Monolith - the architecture anti-pattern that misaligned team ownership produces
Thin Spread Teams - the organizational anti-pattern of distributing engineers too thin across too many services
Work Decomposition - how to slice work vertically within a team’s domain boundary
Contract Testing - how to define and enforce the contracts between domain teams

8 - Hypothesis-Driven Development

Treat every change as an experiment with a predicted outcome, measure the result, and adjust future work based on evidence.

Phase 3 - Optimize

Hypothesis-driven development treats every change as an experiment. Instead of building features because someone asked for them and hoping they help, teams state a predicted outcome before writing code, measure the result after deployment, and use the evidence to decide what to do next. Combined with feature flags, small batches, and metrics-driven improvement, this practice closes the loop between shipping and learning.

Why Hypothesis-Driven Development

Most teams ship features without stating what outcome they expect. A product manager requests a feature, developers build it, and everyone moves on to the next item. Weeks later, nobody checks whether the feature actually helped.

This is waste. Teams accumulate features without knowing their impact, backlogs grow based on opinion rather than evidence, and the product drifts in whatever direction the loudest voice demands.

Hypothesis-driven development fixes this by making every change answer a question. If the answer is “yes, it helped,” the team invests further. If the answer is “no,” the team reverts or pivots before sinking more effort into the wrong direction. Over time, this produces a product shaped by evidence rather than assumptions.

The Lifecycle

The hypothesis-driven development lifecycle has five stages. Each stage has a specific purpose and a clear output that feeds the next stage.

1. Form the Hypothesis

A hypothesis is a falsifiable prediction about what a change will accomplish. It follows a specific format:

“We believe [change] will produce [outcome] because [reason].”

The “because” clause is critical. Without it, you have a wish, not a hypothesis. The reason forces the team to articulate the causal model behind the change, which makes it possible to learn even when the experiment fails.

Good hypothesis vs. bad hypothesis

**Good:** "We believe adding a progress indicator to the checkout flow will reduce cart abandonment by 10% because users currently leave when they cannot tell how many steps remain." - Specific change (progress indicator in checkout) - Measurable outcome (10% reduction in cart abandonment) - Stated reason (users leave due to uncertainty about remaining steps) --- **Bad:** "We believe improving the checkout experience will increase conversions." - Vague change (what does "improving" mean?) - No target (how much increase?) - No reason (why would it increase conversions?)

Criteria for a testable hypothesis:

Criterion	Test	Example
Specific change	Can you describe exactly what will be different?	“Add a 3-step progress bar to the checkout page header”
Measurable outcome	Can you define a number that will move?	“Cart abandonment rate drops from 45% to 40%”
Time-bound	Do you know when to check?	“Measured over 2 weeks with at least 5,000 sessions”
Falsifiable	Is it possible for the experiment to fail?	Yes - abandonment could stay the same or increase
Connected to business value	Does the outcome matter to the business?	Reduced abandonment directly increases revenue

2. Design the Experiment

Once the hypothesis is formed, design an experiment that can confirm or reject it.

Scope the change to one variable. If you change the checkout layout and add a progress indicator and reduce the number of form fields at the same time, you cannot attribute the outcome to any single change. Change one thing at a time.

Define success and failure criteria before writing code. This prevents moving the goalposts after seeing the results. Write down what “success” looks like and what “failure” looks like before the first commit.

Experiment design template

**Hypothesis:** Adding a progress indicator will reduce cart abandonment by 10%. **Method:** A/B test - 50% of users see the progress indicator, 50% see the current checkout. **Success criteria:** Abandonment rate in the test group is at least 8% lower than control (allowing a 2% margin). **Failure criteria:** Abandonment rate difference is less than 5%, or the test group shows higher abandonment. **Sample size:** Minimum 5,000 sessions per group. **Time box:** 2 weeks or until sample size is reached, whichever comes first.

Choose the measurement method:

Method	When to Use	Tradeoff
A/B test	You have enough traffic to split users into groups	Most rigorous, but requires sufficient volume
Before/after	Low traffic or infrastructure changes that affect everyone	Simpler, but confounding factors are harder to control
Cohort comparison	Targeting a specific user segment	Good for segment-specific changes, harder to generalize

3. Implement and Deploy

Build the change using the same continuous delivery practices you use for any other work.

Use feature flags to control exposure. The feature flag infrastructure you built earlier in this phase is what makes experiments possible. Deploy the change behind a flag, then use the flag to control which users see the new behavior.

Deploy through the standard CD pipeline. Experiments are not special. They go through the same build, test, and deployment process as every other change. This ensures the experiment code meets the same quality bar as production code.

Keep the change small. A hypothesis-driven change should follow the same small batch discipline as any other work. If the experiment requires weeks of development, the scope is too large. Break it into smaller experiments that can each be measured independently.

Example implementation:

Feature flag controlling an A/B experiment

public class CheckoutController {

    private final FeatureFlagService flags;
    private final MetricsService metrics;

    public CheckoutController(FeatureFlagService flags, MetricsService metrics) {
        this.flags = flags;
        this.metrics = metrics;
    }

    public CheckoutPage renderCheckout(User user, Cart cart) {
        boolean showProgress = flags.isEnabled("experiment-checkout-progress", user);

        metrics.record("checkout-started", Map.of(
            "variant", showProgress ? "with-progress" : "control",
            "userId", user.getId()
        ));

        if (showProgress) {
            return new CheckoutPage(cart, new ProgressIndicator(3));
        }
        return new CheckoutPage(cart);
    }
}

4. Measure Results

After the time box expires or the sample size is reached, compare the results against the predefined success criteria.

Compare against your criteria, not against your hopes. If the success criterion was “8% reduction in abandonment” and you achieved 3%, that is a failure by your own definition, even if 3% sounds nice. Rigorous criteria prevent confirmation bias.

Account for confounding factors. Did a marketing campaign run during the experiment? Was there a holiday? Did another team ship a change that affects the same flow? Document anything that might have influenced the results.

Record the outcome regardless of success or failure. Failed experiments are as valuable as successful ones. They update the team’s understanding of how the product works and prevent repeating the same mistakes.

Experiment result record

**Hypothesis:** Progress indicator reduces cart abandonment by 10%. **Result:** Abandonment dropped 4% in the test group (not statistically significant at p < 0.05). **Verdict:** Failed - did not meet the 8% threshold. **Confounding factors:** A site-wide sale ran during week 2, which may have increased checkout motivation in both groups. **Learning:** Progress visibility alone is not sufficient to address abandonment. Exit survey data suggests price comparison (leaving to check competitors) is the primary driver, not checkout confusion. **Next action:** Design a new experiment targeting price confidence instead of checkout flow.

5. Adjust

The final stage closes the loop. Based on the results, the team takes one of three actions:

If validated: Remove the feature flag and make the change permanent. Update the product documentation. Feed the learning into the next hypothesis - what else could you improve now that this change is in place?

If invalidated: Revert the change by disabling the flag. Document what was learned and why the hypothesis was wrong. Use the learning to form a better hypothesis. Do not treat invalidation as failure - a team that never invalidates a hypothesis is not running real experiments.

If inconclusive: Decide whether to extend the experiment (more time, more traffic) or abandon it. If confounding factors were identified, consider rerunning the experiment under cleaner conditions. Set a hard limit on reruns to avoid indefinite experimentation.

Common Pitfalls

Pitfall	What Happens	How to Avoid It
No success criteria defined upfront	Team rationalizes any result as a win	Write success and failure criteria before the first commit
Changing multiple variables at once	Cannot attribute the outcome to any single change	Scope each experiment to one variable
Abandoning experiments too early	Insufficient data leads to wrong conclusions	Set a minimum sample size and time box; commit to both
Never invalidating a hypothesis	Experiments are performative, not real	Celebrate invalidations - they prevent wasted effort
Skipping the record step	Team repeats failed experiments or forgets what worked	Maintain an experiment log that is part of the team’s knowledge base
Hypothesis disconnected from business outcomes	Team optimizes technical metrics nobody cares about	Every hypothesis must connect to a metric the business tracks
Experiments that are too large	Weeks of development before any measurement	Apply small batch discipline to experiments too

Measuring Success

Indicator	Target	Why It Matters
Experiments completed per quarter	4 or more	Confirms the team is running experiments, not just shipping features
Percentage of experiments with predefined success criteria	100%	Confirms rigor - no experiment should start without criteria
Ratio of validated to invalidated hypotheses	Between 40-70% validated	Too high means hypotheses are not bold enough; too low means the team is guessing
Time from hypothesis to result	2-4 weeks	Confirms experiments are scoped small enough to get fast answers
Decisions changed by experiment results	Increasing	Confirms experiments actually influence product direction

Next Step

Experiments generate learnings, but learnings only turn into improvements when the team discusses them. Retrospectives provide the forum where the team reviews experiment results, decides what to do next, and adjusts the process itself.

Metrics-Driven Improvement - the measurement infrastructure that hypothesis-driven development depends on
Small Batches - the practice that keeps experiments small enough to measure
Feature Flags - the mechanism that controls experiment exposure
Retrospectives - where the team discusses experiment results and decides next steps
First-Class Artifacts - how ACD formalizes experiment artifacts for agent-assisted workflows
Agent-Assisted Specification - how agents can help generate and evaluate hypotheses

Phase 3: Optimize

What You’ll Do

Why This Phase Matters

When You’re Ready to Move On

Related Content

1 - Small Batches

Why Batch Size Matters

Three Levels of Batch Size

Level 1: Deploy Frequency

Level 2: Commit Size

Level 3: Story Size

Behavior-Driven Development for Decomposition

The Given-When-Then Pattern

Decomposing Stories Using Scenarios

ATDD: Connecting Scenarios to Daily Integration

Service-Level Decomposition Example

Vertical Slicing

Horizontal vs. Vertical Slicing

How to Slice Vertically

Vertical slicing in distributed systems

Story Slicing Anti-Patterns

Practical Steps for Reducing Batch Size

Step 1: Measure Current State

Step 2: Introduce Story Decomposition

Step 3: Tighten Commit Size

Ongoing: Increase Deploy Frequency

Key Pitfalls

1. “Small stories take more overhead to manage”

2. “Some things can’t be done in small batches”

3. “We tried small stories but our throughput dropped”

Measuring Success

Next Step

Related Content

2 - Feature Flags

Why Feature Flags?

When You Need Feature Flags (and When You Don’t)

Decision Tree

Alternatives to Feature Flags

Implementation Approaches

Level 1: Static Code-Based Flags

Level 2: Dynamic In-Process Flags

Level 3: Centralized Flag Service

Level 4: Infrastructure Routing

Feature Flag Lifecycle

The Stages

Stage 1: Create

Stage 2: Deploy OFF

Stage 3: Build Incrementally

Stage 4: Dark Launch

Stage 5: Gradual Rollout

Stage 6: Remove

Lifecycle Timeline Example

Long-Lived Feature Flags

Operational Flags (Kill Switches)

Customer-Specific Toggles

Experimentation Flags

Managing Long-Lived Flags

Key Pitfalls

1. “We have 200 feature flags and nobody knows what they all do”

2. “We use flags for everything, including configuration”

3. “Testing both paths doubles our test burden”

4. “Nested flags create combinatorial complexity”

Flag Removal Anti-Patterns

Measuring Success

Next Step

Related Content

3 - Limiting Work in Progress

Why Limiting WIP Matters

How to Set Your WIP Limit

The N+2 Starting Point

Continuously Lower the Limit

What Happens When You Hit the Limit

Swarming

When to Swarm

How to Swarm Effectively

Why Teams Resist Swarming

How Limiting WIP Exposes Workflow Issues

Implementing WIP Limits

Step 1: Make WIP Visible

Step 2: Set the Initial Limit