This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Dysfunction Symptoms

Start from what you observe. Find the anti-patterns causing it.

Not sure which anti-pattern is hurting your team? Start here. Choose the path that fits how you want to explore.

Find your symptom

Answer a few questions to narrow down which symptoms match your situation.

Start the triage questions

Browse by category

Jump directly to the area where you are experiencing problems.

Start from your role

Each role sees different symptoms first. Find the ones most relevant to your daily work.

  • For Developers - Symptoms you hit while writing, testing, and shipping code - from flaky tests to painful merges
  • For Managers - Symptoms that show up as unpredictable delivery, quality gaps, and team health problems

Explore by theme

Symptoms and anti-patterns share common themes. Browse by tag to see connections across categories.

View all tags

1 - Test Suite Problems

Symptoms related to test reliability, coverage effectiveness, speed, and environment consistency.

These symptoms indicate problems with your testing strategy. Unreliable or slow tests erode confidence and slow delivery. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Testing Anti-Patterns, Pipeline Anti-Patterns

Related guide: Testing Fundamentals

1.1 - AI-Generated Code Ships Without Developer Understanding

Developers accept AI-generated code without verifying it against acceptance criteria, and functional bugs and security vulnerabilities reach production unchallenged.

What you are seeing

A developer asks an AI assistant to implement a feature. The generated code looks plausible. The tests pass. The developer commits it. Two weeks later, a security review finds the code accepts unsanitized input in a path nobody specified as an acceptance criterion. When asked what the change was supposed to do, the developer says, “It implements the feature.” When asked how they validated it, they say, “The tests passed.”

This is not an occasional gap. It is a pattern. Developers use AI to produce code faster, but they do not define what “correct” means before generating code, verify the output against specific acceptance criteria, or consider how they would detect a failure in production. The code compiles. The tests pass. Nobody validated it against the actual requirements.

The symptoms compound over time. Defects appear in AI-generated code that the team cannot diagnose quickly because nobody defined what the code was supposed to do beyond “implement the feature.” Fixes are made by asking the AI to fix its own output without re-examining the original acceptance criteria. Security vulnerabilities - injection flaws, broken access controls, exposed credentials - ship because nobody asked “what are the security constraints for this change?” before or after generation.

Common causes

Rubber-Stamping AI-Generated Code

When there is no expectation that developers own what a change does and how they validated it - regardless of who or what wrote the code - AI output gets the same cursory glance as a trivial formatting change. The team treats “AI wrote it and the tests pass” as sufficient evidence of correctness. It is not. Passing tests prove the code satisfies the test cases. They do not prove the code meets the actual requirements or handles the constraints the team cares about.

Read more: Rubber-Stamping AI-Generated Code

Missing Acceptance Criteria

When the work item lacks concrete acceptance criteria - specific inputs, expected outputs, security constraints, edge cases - neither the developer nor the AI has a clear target. The AI generates something that looks right. The developer has no checklist to verify it against. The review is a subjective “does this seem okay?” rather than an objective “does this satisfy every stated requirement?”

Read more: Monolithic Work Items

Inverted Test Pyramid

When the test suite relies heavily on end-to-end tests and lacks targeted unit and functional tests, AI-generated code can pass the suite without its internal logic being verified. A comprehensive functional test suite would catch the cases where the AI’s implementation diverges from the domain rules. Without it, “tests pass” is a weak signal.

Read more: Inverted Test Pyramid

How to narrow it down

  1. Can developers explain what their recent changes do and how they validated them? Pick three recent AI-assisted commits at random and ask the committing developer: what does this change accomplish, what acceptance criteria did you verify, and how would you detect if it were wrong? If they cannot answer, the review process is not catching unexamined code. Start with Rubber-Stamping AI-Generated Code.
  2. Do your work items include specific, testable acceptance criteria before implementation starts? If acceptance criteria are vague or added after the fact, neither the AI nor the developer has a clear target. Start with Monolithic Work Items.
  3. Does your test suite include functional tests that verify business rules with specific inputs and outputs? If the suite is mostly end-to-end or integration tests, AI-generated code can satisfy them without being correct at the rule level. Start with Inverted Test Pyramid.

Ready to fix this? The most common cause is Rubber-Stamping AI-Generated Code. Start with its How to Fix It section for week-by-week steps.

1.2 - Tests Pass in One Environment but Fail in Another

Tests pass locally but fail in CI, or pass in CI but fail in staging. Environment differences cause unpredictable failures.

What you are seeing

A developer runs the tests locally and they pass. They push to CI and the same tests fail. Or the CI pipeline is green but the tests fail in the staging environment. The failures are not caused by a code defect. They are caused by differences between environments: a different OS version, a different database version, a different timezone setting, a missing environment variable, or a service that is available locally but not in CI.

The developer spends time debugging the failure and discovers the root cause is environmental, not logical. They add a workaround (skip the test in CI, add an environment check, adjust a timeout) and move on. The workaround accumulates over time. The test suite becomes littered with environment-specific conditionals and skipped tests.

The team loses confidence in the test suite because results depend on where the tests run rather than whether the code is correct.

Common causes

Snowflake Environments

When each environment is configured by hand and maintained independently, they drift apart over time. The developer’s laptop has one version of a database driver. The CI server has another. The staging environment has a third. These differences are invisible until a test exercises a code path that behaves differently across versions. The fix is not to harmonize configurations manually (they will drift again) but to provision all environments from the same infrastructure code.

Read more: Snowflake Environments

Manual Deployments

When deployment and environment setup are manual processes, subtle differences creep in. One developer installed a dependency a particular way. The CI server was configured by a different person with slightly different settings. The staging environment was set up months ago and has not been updated. Manual processes are never identical twice, and the variance causes environment- dependent behavior.

Read more: Manual Deployments

Tightly Coupled Monolith

When the application has hidden dependencies on external state (filesystem paths, network services, system configuration), tests that work in one environment fail in another because the external state differs. Well-isolated code with explicit dependencies is portable across environments. Tightly coupled code that reaches into its environment for implicit dependencies is fragile.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Are all environments provisioned from the same infrastructure code? If not, environment drift is the most likely cause. Start with Snowflake Environments.
  2. Are environment setup and configuration manual? If different people configured different environments, the variance is a direct result of manual processes. Start with Manual Deployments.
  3. Do the failing tests depend on external services, filesystem paths, or system configuration? If tests assume specific external state rather than declaring explicit dependencies, the code’s coupling to its environment is the issue. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.

1.3 - High Coverage but Tests Miss Defects

Test coverage numbers look healthy but defects still reach production.

What you are seeing

Your dashboard shows 80% or 90% code coverage, but bugs keep getting through. Defects show up in production that feel like they should have been caught. The team points to the coverage number as proof that testing is solid, yet the results tell a different story.

People start losing trust in the test suite. Some developers stop running tests locally because they do not believe the tests will catch anything useful. Others add more tests, pushing coverage higher, without the defect rate improving.

Common causes

Inverted Test Pyramid

When most of your tests are end-to-end or integration tests, they exercise many code paths in a single run - which inflates coverage numbers. But these tests often verify that a workflow completes without errors, not that each piece of logic produces the correct result. A test that clicks through a form and checks for a success message covers dozens of functions without validating any of them in detail.

Read more: Inverted Test Pyramid

Pressure to Skip Testing

When teams face pressure to hit a coverage target, testing becomes theater. Developers write tests with trivial assertions - checking that a function returns without throwing, or that a value is not null - just to get the number up. The coverage metric looks healthy, but the tests do not actually verify behavior. They exist to satisfy a gate, not to catch defects.

Read more: Pressure to Skip Testing

Code Coverage Mandates

When the organization gates the pipeline on a coverage target, teams optimize for the number rather than for defect detection. Developers write assertion-free tests, cover trivial code, or add single integration tests that execute hundreds of lines without validating any of them. The coverage metric rises while the tests remain unable to catch meaningful defects.

Read more: Code Coverage Mandates

Manual Testing Only

When test automation is absent or minimal, teams sometimes generate superficial tests or rely on coverage from integration-level runs that touch many lines without asserting meaningful outcomes. The coverage tool counts every line that executes, regardless of whether any test validates the result.

Read more: Manual Testing Only

How to narrow it down

  1. Do most tests assert on behavior and expected outcomes, or do they just verify that code runs without errors? If tests mostly check for no-exceptions or non-null returns, the problem is testing theater - tests written to hit a number, not to catch defects. Start with Pressure to Skip Testing.
  2. Are the majority of your tests end-to-end or integration tests? If most of the suite runs through a browser, API, or multi-service flow rather than testing units of logic directly, start with Inverted Test Pyramid.
  3. Does the pipeline gate on a specific coverage percentage? If the team writes tests primarily to keep coverage above a mandated threshold, start with Code Coverage Mandates.
  4. Were tests added retroactively to meet a coverage target? If the bulk of tests were written after the code to satisfy a coverage gate rather than to verify design decisions, start with Pressure to Skip Testing.

Ready to fix this? The most common cause is Code Coverage Mandates. Start with its How to Fix It section for week-by-week steps.


1.4 - A Large Codebase Has No Automated Tests

Zero test coverage in a production system being actively modified. Nobody is confident enough to change the code safely.

What you are seeing

Every modification to this codebase is a gamble. The system has no automated tests. Changes are validated through manual testing, if they are validated at all. Developers work carefully but know that any change could trigger failures in code they did not touch, because the system has no seams and no isolation. The only way to know if a change works is to deploy it and observe what breaks.

Refactoring is effectively off the table. Improving the design of the code requires changing it in ways that should not alter behavior - but with no tests, there is no way to verify that behavior was preserved. Developers choose to add code around existing code rather than improve it, because change is unsafe. The codebase grows more complex with every feature because improving the underlying structure carries too much risk.

The team knows the situation is unsustainable but cannot see a path out. “We should write tests” appears in every retrospective. The problem is that adding tests to an untestable codebase requires refactoring first - and refactoring requires tests to do safely. The team is stuck in a loop with no obvious entry point.

Common causes

Manual testing only

The team has relied on manual testing as the primary quality gate. Automated tests were never required, never prioritized, and never resourced. The codebase was built without testability as a design constraint, which means the architecture does not accommodate automated testing without structural change.

Making the transition requires making a deliberate commitment: new code is always written with tests, existing code gets tests when it is modified, and high-risk areas are prioritized for retrofitted coverage. Over months, the areas of the codebase where developers can no longer safely make changes shrink, and the cycle of deploying to discover breakage is replaced by a test suite that catches failures before production.

Read more: Manual testing only

Tightly coupled monolith

Code without dependency injection, without interfaces, and without clear module boundaries cannot be tested without a major structural overhaul. Every function calls other functions directly. Every component reaches into every other component. Writing a test for one function requires instantiating the entire system.

Introducing seams - interfaces, dependency injection, module boundaries - makes code testable. This work is not glamorous and its value is invisible until tests start getting written. But it is the prerequisite for meaningful test coverage in a tightly coupled system. Once the seams exist, functions can be tested in isolation rather than requiring a full application instantiation - and developers stop needing to deploy to find out if a change is safe.

Read more: Tightly coupled monolith

Pressure to skip testing

If management has historically prioritized features over tests, the codebase will reflect that history. Tests were deferred sprint by sprint. Technical debt accumulated. The team that exists today is inheriting the decisions of teams that operated under different constraints, but the codebase carries the record of every time testing lost to deadline pressure.

Reversing this requires organizational commitment to treat test coverage as a delivery requirement, not as optional work that gets squeezed out when time is short. Without that commitment, the same pressure that created the untested codebase will prevent escaping it - and developers will keep gambling on every deploy.

Read more: Pressure to skip testing

How to narrow it down

  1. Can any single function in the codebase be tested without instantiating the entire application? If not, the architecture does not have the seams needed for unit tests. Start with Tightly coupled monolith.
  2. Has the team ever had a sustained period of writing tests as part of normal development? If not, the practice was never established. Start with Manual testing only.
  3. Did historical management decisions consistently deprioritize testing? If test debt accumulated from external pressure, the organizational habit needs to change before the technical situation can improve. Start with Pressure to skip testing.

Ready to fix this? The most common cause is Manual testing only. Start with its How to Fix It section for week-by-week steps.

1.5 - Refactoring Breaks Tests

Internal code changes that do not alter behavior cause widespread test failures.

What you are seeing

A developer renames a method, extracts a class, or reorganizes modules - changes that should not affect external behavior. But dozens of tests fail. The failures are not catching real bugs. They are breaking because the tests depend on implementation details that changed.

Developers start avoiding refactoring because the cost of updating tests is too high. Code quality degrades over time because cleanup work is too expensive. When someone does refactor, they spend more time fixing tests than improving the code.

Common causes

Inverted Test Pyramid

When the test suite is dominated by end-to-end and integration tests, those tests tend to be tightly coupled to implementation details - CSS selectors, API response shapes, DOM structure, or specific sequences of internal calls. A refactoring that changes none of the observable behavior still breaks these tests because they assert on how the system works rather than what it does.

Unit tests focused on behavior (“given this input, expect this output”) survive refactoring. Tests coupled to implementation (“this method was called with these arguments”) do not.

Read more: Inverted Test Pyramid

Tightly Coupled Monolith

When components lack clear interfaces, tests reach into the internals of other modules. A refactoring in module A breaks tests for module B - not because B’s behavior changed, but because B’s tests were calling A’s internal methods directly. Without well-defined boundaries, every internal change ripples across the test suite.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Do the broken tests assert on internal method calls, mock interactions, or DOM structure? If yes, the tests are coupled to implementation rather than behavior. This is a test design issue - start with Inverted Test Pyramid for guidance on building a behavior-focused test suite.
  2. Are the broken tests end-to-end or UI tests that fail because of layout or selector changes? If yes, you have too many tests at the wrong level of the pyramid. Start with Inverted Test Pyramid.
  3. Do the broken tests span multiple modules - testing code in one area but breaking because of changes in another? If yes, the problem is missing boundaries between components. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.


1.6 - Test Environments Take Too Long to Reset Between Runs

The team cannot run the full regression suite on every change because resetting the test environment and database takes too long.

What you are seeing

The team has a regression test suite that covers critical business flows. Running the tests themselves takes twenty minutes. Resetting the test environment - restoring the database to a known state, restarting services, clearing caches, reloading reference data - takes another forty minutes. The total cycle is an hour. With multiple teams queuing for the same environment, a developer might wait half a day to get feedback on a single change.

The team makes a practical decision: run the full regression suite nightly, or before a release, but not on every change. Individual changes get a subset of tests against a partially reset environment. Bugs that depend on data state - stale records, unexpected reference data, leftover test artifacts - slip through because the partial reset does not catch them. The full suite catches them later, but by then several changes have been merged and isolating which one introduced the regression takes a multi-person investigation.

Some teams stop running the full suite entirely. The reset time is so long that the suite becomes a release gate rather than a development tool. Developers lose confidence in the suite because they rarely see it run and the failures they do see are often environment artifacts rather than real bugs.

Common causes

Shared Test Environments

When multiple teams share a single test environment, the environment is never in a clean state. One team’s tests leave data behind. Another team’s tests depend on data that was just deleted. Resetting the environment means restoring it to a state that works for all teams, which requires coordination and takes longer than resetting a single-team environment.

The shared environment also creates queuing. Only one test run can use the environment at a time. Each team waits for the previous run to finish and the environment to reset before starting their own.

Read more: Shared Test Environments

Manual Regression Testing Gates

When the regression suite is treated as a manual checkpoint rather than an automated pipeline stage, the environment setup is also manual or semi-automated. Scripts that restore the database, restart services, and verify the environment is ready have accumulated over time without being optimized. Nobody has invested in making the reset fast because the suite was never intended to run on every change.

Read more: Manual Regression Testing Gates

Too Many Hard Dependencies in the Test Suite

When tests require live databases, running services, and real network connections for every assertion, the environment reset is slow because every dependency must be restored to a known state. A test that validates billing logic should not need a running payment gateway. A test that checks order validation should not need a populated product catalog database.

The fix is to match each test to the right layer. Functional tests that verify business rules use in-memory databases or controlled fixtures - no environment reset needed. Contract tests verify service boundaries with virtual services instead of live instances. Only a small number of end-to-end tests need the fully assembled environment, and those run outside the pipeline’s critical path. When the pipeline’s critical path depends on heavyweight integration for every assertion, the reset time is a direct consequence of testing at the wrong layer.

Read more: Inverted Test Pyramid

Testing Only at the End

When testing is deferred to a late stage - after development, after integration, before release

  • the tests assume a fully assembled system with a production-like database. Resetting that system is inherently slow because it involves restoring a large database, restarting multiple services, and verifying cross-service connectivity. The tests were designed for a heavyweight environment because they run at a heavyweight stage.

Tests designed to run early - functional tests with controlled data, contract tests between services - do not need environment resets. They run in isolation with their own data fixtures.

Read more: Testing Only at the End

How to narrow it down

  1. Is the environment shared across multiple teams or test suites? If teams queue for a single environment, the reset time is compounded by coordination. Start with Shared Test Environments.
  2. Does the reset process involve restoring a large database from backup? If the database restore is the bottleneck, the tests depend on global data state rather than controlling their own data. Start with Manual Regression Testing Gates and refactor tests to use isolated data fixtures.
  3. Do most tests require live databases, running services, or network connections? If the majority of tests need the fully assembled environment, the suite is testing at the wrong layer. Functional tests with in-memory databases and virtual services for external dependencies would eliminate the reset bottleneck for most assertions. Start with Inverted Test Pyramid.
  4. Does the full suite only run before releases, not on every change? If the suite is a release gate rather than a pipeline stage, it was designed for a different feedback loop. Start with Testing Only at the End and move tests earlier in the pipeline.

Ready to fix this? The most common cause is Shared Test Environments. Start with its How to Fix It section for week-by-week steps.

1.7 - Test Suite Is Too Slow to Run

The test suite takes 30 minutes or more. Developers stop running it locally and push without verifying.

What you are seeing

The full test suite takes 30 minutes, an hour, or longer. Developers do not run it locally because they cannot afford to wait. Instead, they push their changes and let CI run the tests. Feedback arrives long after the developer has moved on. If a test fails, the developer must context-switch back, recall what they were doing, and debug the failure.

Some developers run only a subset of tests locally (the ones for their module) and skip the rest. This catches some issues but misses integration problems between modules. Others skip local testing entirely and treat the CI pipeline as their test runner, which overloads the shared pipeline and increases wait times for everyone.

The team has discussed parallelizing the tests, splitting the suite, or adding more CI capacity. These discussions stall because the root cause is not infrastructure. It is the shape of the test suite itself.

Common causes

Inverted Test Pyramid

When the majority of tests are end-to-end or integration tests, the suite is inherently slow. E2E tests launch browsers, start services, make network calls, and wait for responses. Each test takes seconds or minutes instead of milliseconds. A suite of 500 E2E tests will always be slower than a suite of 5,000 unit tests that verify the same logic at a lower level. The fix is not faster hardware. It is moving test coverage down the pyramid.

Read more: Inverted Test Pyramid

Tightly Coupled Monolith

When the codebase has no clear module boundaries, tests cannot be scoped to individual components. A test for one feature must set up the entire application because the feature depends on everything. Test setup and teardown dominate execution time because there is no way to isolate the system under test.

Read more: Tightly Coupled Monolith

Manual Testing Only

Sometimes the test suite is slow because the team added automated tests as an afterthought, using E2E tests to backfill coverage for code that was not designed for unit testing. The resulting suite is a collection of heavyweight tests that exercise the full stack for every scenario because the code provides no lower-level testing seams.

Read more: Manual Testing Only

How to narrow it down

  1. What is the ratio of unit tests to E2E/integration tests? If E2E tests outnumber unit tests, the test pyramid is inverted and the suite is slow by design. Start with Inverted Test Pyramid.
  2. Can tests be run for a single module in isolation? If running one module’s tests requires starting the entire application, the architecture prevents test isolation. Start with Tightly Coupled Monolith.
  3. Were the automated tests added retroactively to a codebase with no testing seams? If tests were bolted on after the fact using E2E tests because the code cannot be unit-tested, the codebase needs refactoring for testability. Start with Manual Testing Only.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

1.8 - Tests Interfere with Each Other Through Shared Data

Tests share mutable state in a common database. Results vary by run order, making failures unreliable signals of real bugs.

What you are seeing

Your test suite is technically running, but the results are a coin flip. A test that passed yesterday fails today because another test ran first and left dirty data in the shared database. You spend thirty minutes debugging a failure only to find the root cause was a record inserted by an unrelated test two hours ago. When you rerun the suite in isolation, everything passes. When you run it in CI with the full suite, it fails at random.

Shared database state is the source of the chaos. The database schema and seed data were set up once, years ago, by someone who has since left. Nobody is sure what state the database is supposed to be in before any given test. Some tests clean up after themselves; most do not. Some tests depend on records created by other tests. The execution order matters, but nobody explicitly controls it - so the suite is fragile by construction.

The downstream effect is that your team has stopped trusting test failures. When a red build appears, the first instinct is not “there is a bug” but “someone broke the test data again.” You rerun the build, it goes green, and you ship. Real bugs make it to production because the signal-to-noise ratio of your test suite has collapsed.

Common causes

Manual testing only

Teams that have relied on manual testing tend to reach for a shared database as the natural extension of how testers have always worked - against a shared test environment. When automated tests are added later, they inherit the same model: one environment, one database, shared by everyone. Nobody designed a data strategy; it evolved from how the team already worked.

When teams shift to isolated test data - each test owns and tears down its own data - interference disappears. Tests become deterministic. A failing test means code is broken, not the environment.

Read more: Manual testing only

Inverted test pyramid

When most automated tests are end-to-end or integration tests that exercise a real database, test data problems compound. Each test requires realistic, complex data to be in place. The more tests that depend on a shared database, the more opportunities for interference and the harder it becomes to manage the data lifecycle.

Shifting toward a pyramid with a large base of unit tests reduces database dependency dramatically. Unit tests run against in-memory structures and do not touch shared state. The integration and end-to-end tests that remain can be designed more carefully with isolated, purpose-built datasets. With fewer tests competing for shared database rows, the random CI failures that triggered “just rerun it” reflexes become rare, and a red build is a signal worth investigating.

Read more: Inverted test pyramid

Snowflake environments

When test environments are hand-crafted and not reproducible from code, database state drifts over time. Schema migrations get applied inconsistently. Seed data scripts run at different times in different environments. Each environment develops its own data personality, and tests written against one environment fail on another.

Reproducible environments - created from code on demand and destroyed after use - eliminate drift. When the database is provisioned fresh from a migration script and a known seed set for each test run, the starting state is always predictable. Tests that produced different results on different machines or at different times start producing consistent results, and the team can stop dismissing CI failures as environment noise.

Read more: Snowflake environments

How to narrow it down

  1. Do tests pass when run individually but fail when run together? Mutual interference from shared mutable state is the most likely cause. Start with Inverted test pyramid.
  2. Does the test suite pass on one machine but fail in CI? The test environment differs from the developer’s local database. Start with Snowflake environments.
  3. Is there no documented strategy for setting up and tearing down test data? The team never established a data strategy. Start with Manual testing only.

Ready to fix this? The most common cause is Inverted test pyramid. Start with its How to Fix It section for week-by-week steps.

1.9 - Tests Randomly Pass or Fail

The pipeline fails, the developer reruns it without changing anything, and it passes.

What you are seeing

A developer pushes a change. The pipeline fails on a test they did not touch, in a module they did not change. They click rerun. It passes. They merge. This happens multiple times a day across the team. Nobody investigates failures on the first occurrence because the odds favor flakiness over a real problem.

The team has adapted: retry-until-green is a routine step, not an exception. Some pipelines are configured to automatically rerun failed tests. Tests are tagged as “known flaky” and skipped. Real regressions hide behind the noise because the team has been trained to ignore failures.

Common causes

Inverted Test Pyramid

When the test suite is dominated by end-to-end tests, flakiness is structural. E2E tests depend on network connectivity, shared test environments, external service availability, and browser rendering timing. Any of these can produce a different result on each run. A suite built mostly on E2E tests will always be flaky because it is built on non-deterministic foundations.

Replacing E2E tests with functional tests that use test doubles for external dependencies makes the suite deterministic by design. The test produces the same result every time because it controls all its inputs.

Read more: Inverted Test Pyramid

Snowflake Environments

When the CI environment is configured differently from other environments - or drifts over time - tests pass locally but fail in CI, or pass in CI on Tuesday but fail on Wednesday. The inconsistency is not in the test or the code but in the environment the test runs in.

Tests that depend on specific environment configurations, installed packages, file system layout, or network access are vulnerable to environment drift. Infrastructure-as-code eliminates this class of flakiness by ensuring environments are identical and reproducible.

Read more: Snowflake Environments

Tightly Coupled Monolith

When components share mutable state - a database, a cache, a filesystem directory - tests that run concurrently or in a specific order can interfere with each other. Test A writes to a shared table. Test B reads from the same table and gets unexpected data. The tests pass individually but fail together, or pass in one order but fail in another.

Without clear component boundaries, tests cannot be isolated. The flakiness is a symptom of architectural coupling, not a testing problem.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Do the flaky tests hit real external services or shared environments? If yes, the tests are non-deterministic by design. Start with Inverted Test Pyramid and replace them with functional tests using test doubles.
  2. Do tests pass locally but fail in CI, or vice versa? If yes, the environments differ. Start with Snowflake Environments.
  3. Do tests pass individually but fail when run together, or fail in a different order? If yes, tests share mutable state. Start with Tightly Coupled Monolith for the architectural root cause, and isolate test data as an immediate fix.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

2 - Deployment and Release Problems

Symptoms related to deployment frequency, release risk, coordination overhead, and environment parity.

These symptoms indicate problems with your deployment and release process. When deploying is painful, teams deploy less often, which increases batch size and risk. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Pipeline Anti-Patterns, Architecture Anti-Patterns

Related guides: Pipeline Architecture, Rollback, Small Batches

2.1 - API Changes Break Consumers Without Warning

Breaking API changes reach all consumers simultaneously. Teams are afraid to evolve APIs because they do not know who depends on them.

What you are seeing

The team renames a field in an API response and a half-dozen consuming services start failing within minutes of deployment. Some consumers had documentation saying the API might change. Most assumed stability because the API had not changed in two years. The team spends the afternoon rolling back, notifying downstream owners, and coordinating a migration plan that will take weeks.

The harder problem is that the team does not know who depends on their API. Internal consumers are spread across teams and may not have registered their dependency anywhere. External consumers may have been added by third-party integrators years ago. Changing the API requires identifying every consumer and coordinating their migration - a process so expensive that the team simply stops evolving the API. It calcifies around its original design.

This leads to two failure modes: teams break APIs and cause incidents because they underestimate consumer impact, or teams freeze APIs and accumulate technical debt because the coordination cost of changing anything is too high.

Common causes

Distributed monolith

When services that are nominally independent must be coordinated in practice, API changes require simultaneous updates across multiple services. The consuming service cannot be deployed until the providing service is deployed, which requires coordinating deployment timing, which turns an API change into a coordinated release event.

Services that are truly independent can manage API compatibility through versioning or parallel versions: the old endpoint stays available while consumers migrate to the new one at their own pace. Consumers stop breaking on deployment day because they were never forced to migrate simultaneously - they adopt the new interface on their own schedule.

Read more: Distributed monolith

Tightly coupled monolith

Tightly coupled services share data structures and schemas in ways that make changing any shared interface expensive. A change to a shared type propagates through the codebase to every caller. There is no stable interface boundary; internal implementation details leak through the API surface.

Services with well-defined interface contracts - stable public APIs backed by flexible internal implementations - can evolve their internals without breaking consumers. The contract is the stable surface; everything behind it can change.

Read more: Tightly coupled monolith

Knowledge silos

When knowledge of who consumes which API lives in one person’s head or in nobody’s head, the team cannot assess the impact of a change. The inventory of consumers is a prerequisite for safe API evolution. Without it, every API change is a known unknown: the team cannot know what they are breaking until it is broken.

Maintaining a service catalog, using contract testing, or even an informal registry of consumer relationships gives the team the ability to evaluate change impact before deploying. The half-dozen services that used to fail within minutes of a deployment now have owners who were notified and prepared in advance - because the team finally knew they existed.

Read more: Knowledge silos

How to narrow it down

  1. Does the team know every consumer of their APIs? If consumer inventory is incomplete or unknown, any API change carries unknown risk. Start with Knowledge silos.
  2. Must consuming services be deployed at the same time as the providing service? If coordinated deployment is required, the services are not truly independent. Start with Distributed monolith.
  3. Do internal implementation changes frequently affect the public API surface? If internal refactoring breaks consumers, the interface boundary is not stable. Start with Tightly coupled monolith.

Ready to fix this? The most common cause is Distributed monolith. Start with its How to Fix It section for week-by-week steps.

2.2 - The Build Runs Again for Every Environment

Build outputs are discarded and rebuilt for each environment. Production is not running the artifact that was tested.

What you are seeing

The build runs in dev, produces an artifact, and tests run against it. Then the artifact is discarded and a new build runs for the staging branch. The staging artifact is tested, then discarded. A third build runs from the production branch. This is the artifact that gets deployed. The team has no way to verify that the artifact deployed to production is equivalent to the one that was tested in staging.

The problem is subtle until it causes an incident. A build that includes a library version cached in the dev builder but not in the staging builder. A build that captures a slightly different git state because a commit was made between the staging and production builds. An environment variable baked into the build artifact that differs between environments. These differences are usually invisible - until they cause a failure in production that cannot be reproduced anywhere else.

The team treats this as normal because “it has always worked this way.” The process was designed when builds were simple and deterministic. As dependencies, build tooling, and environment configurations have grown more complex, the assumption of build equivalence has become increasingly unreliable.

Common causes

Snowflake environments

When build environments differ between stages - different OS versions, cached dependency states, or tool versions - the same source code produces different artifacts in different environments. The “staging artifact” and the “production artifact” are built from nominally the same source but in environments with different characteristics.

Standardized build environments defined as code produce the same artifact from the same source, regardless of where the build runs. When the dev build, the staging build, and the production build all run in the same container with the same pinned dependencies, the team can verify that equivalence rather than assuming it. The production failure that could not be reproduced elsewhere becomes reproducible because the environments are no longer different in invisible ways.

Read more: Snowflake environments

Missing deployment pipeline

A pipeline that promotes a single artifact through environments eliminates the per-environment rebuild entirely. The artifact is built once, assigned a version identifier, stored in an artifact registry, and deployed to each environment in sequence. The artifact that reaches production is exactly the artifact that was tested.

Without a pipeline with artifact promotion, rebuilding per environment is the natural default. Each environment has its own build process, and the relationship between artifacts built for different environments is assumed rather than guaranteed.

Read more: Missing deployment pipeline

How to narrow it down

  1. Is a separate build triggered for each environment? If staging and production builds run independently, the artifacts are not guaranteed to be equivalent. Start with Missing deployment pipeline.
  2. Are the build environments for each stage identical? If dev, staging, and production builds run on different machines with different configurations, the same source will produce different artifacts. Start with Snowflake environments.
  3. Can the team identify the exact artifact version running in production and trace it back to a specific test run? If not, there is no artifact provenance and no guarantee of what was tested. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

2.3 - Every Change Requires a Ticket and Approval Chain

Change management overhead is identical for a one-line fix and a major rewrite. The process creates a queue that delays all changes equally.

What you are seeing

The team has a change management process. Every production change requires a change ticket, an impact assessment, a rollback plan document, a peer review, and final approval from a change board. The process was designed with major infrastructure changes in mind. It is now applied uniformly to every change, including renaming a log message.

The change board meets once a week. If a change misses the cutoff, it waits until next week. Urgent changes require emergency approval, which means tracking down the right people and interrupting them at unpredictable hours. The overhead for a critical security patch is the same as for a feature release. The team has learned to batch changes together to amortize the approval cost, which makes each deployment larger and riskier.

The intent of change management - reducing the risk of production changes - is accomplished here by slowing everything down rather than by increasing confidence in individual changes. The process treats all changes as equally risky regardless of their actual scope or the automated evidence available about their safety.

Common causes

CAB gates

Change advisory boards apply manual approval uniformly to all changes. The board reviews documentation rather than evidence from automated testing and deployment pipelines. This adds calendar time proportional to the board’s meeting cadence, not proportional to the risk of the change. A one-line fix and a major architectural change wait in the same queue.

Automated deployment systems with pipeline-generated evidence - test results, code coverage, artifact provenance - can satisfy the intent of change management without the calendar overhead. Low-risk changes pass automatically; high-risk changes get human review based on objective criteria rather than because everything gets reviewed.

Read more: CAB gates

Manual deployments

When deployments are manual, the change management process exists partly as a compensating control. Since the deployment itself is not automated or auditable, the team adds process before and after to create accountability. Manual processes require manual oversight.

Automated deployments with pipeline logs create a built-in audit trail: which artifact was deployed, which tests it passed, who triggered the deployment, and what the environment state was before and after. This evidence replaces the need for pre-approval documentation for routine changes.

Read more: Manual deployments

Missing deployment pipeline

A pipeline provides objective evidence that a change was tested and what those tests found. Test results, code coverage, dependency scans, and deployment logs are generated as a natural output of the pipeline. This evidence can satisfy auditors and change reviewers without requiring manual documentation.

Without a pipeline, teams substitute documentation for evidence. The change ticket describes what the developer intended to test. It cannot verify that the tests were actually run or that they passed. A pipeline generates verifiable evidence rather than requiring trust in self-reported documentation.

Read more: Missing deployment pipeline

How to narrow it down

  1. Does a committee approve individual production changes? Manual approval boards add calendar-driven delays independent of change risk. Start with CAB gates.
  2. Is the deployment process automated with pipeline-generated audit logs? If deployment requires manual documentation because there is no automated record, the pipeline is the missing foundation. Start with Missing deployment pipeline.
  3. Do small, low-risk changes go through the same process as major changes? If the process is uniform regardless of risk, the classification mechanism - not just the process - needs to change. Start with CAB gates.

Ready to fix this? The most common cause is CAB gates. Start with its How to Fix It section for week-by-week steps.

2.4 - Multiple Services Must Be Deployed Together

Changes cannot go to production until multiple services are deployed in a specific order during a coordinated release window.

What you are seeing

A developer finishes a change to one service. It is tested, reviewed, and ready to deploy. But it cannot go out alone. The change depends on a schema migration in a shared database, a new endpoint in another service, and a UI update in a third. All three teams coordinate a release window. Someone writes a deployment runbook with numbered steps. If step four fails, steps one through three need to be rolled back manually.

The team cannot deploy on a Tuesday afternoon because the other teams are not ready. The change sits in a branch (or merged to main but feature-flagged off) waiting for the coordinated release next Thursday. By then, more changes have accumulated, making the release larger and riskier.

Common causes

Tightly Coupled Architecture

When services share a database, call each other without versioned contracts, or depend on deployment order, they cannot be deployed independently. A change to Service A’s data model breaks Service B if Service B is not updated at the same time. The architecture forces coordination because the boundaries between services are not real boundaries. They are implementation details that leak across service lines.

Read more: Tightly Coupled Monolith

Distributed Monolith

The organization moved from a monolith to services, but the service boundaries are wrong. Services were decomposed along technical lines (a “database service,” an “auth service,” a “notification service”) rather than along domain lines. The result is services that cannot handle a business request on their own. Every user-facing operation requires a synchronous chain of calls across multiple services. If one service in the chain is unavailable or deploying, the entire operation fails.

This is a monolith distributed across the network. It has all the operational complexity of microservices (network latency, partial failures, distributed debugging) with none of the benefits (independent deployment, team autonomy, fault isolation). Deploying one service still requires deploying the others because the boundaries do not correspond to independent units of business functionality.

Read more: Distributed Monolith

Horizontal Slicing

When work for a feature is decomposed by service (“Team A builds the API, Team B updates the UI, Team C modifies the processor”), each team’s change is incomplete on its own. Nothing is deployable until all teams finish their part. The decomposition created the coordination requirement. Vertical slicing within each team’s domain, with stable contracts between services, allows each team to deploy when their slice is ready.

Read more: Horizontal Slicing

Undone Work

Sometimes the coordination requirement is artificial. The service could technically be deployed independently, but the team’s definition of done requires a cross-service integration test that only runs during the release window. Or deployment is gated on a manual approval from another team. The coordination is not forced by the architecture but by process decisions that bundle independent changes into a single release event.

Read more: Undone Work

How to narrow it down

  1. Do services share a database or call each other without versioned contracts? If yes, the architecture forces coordination. Changes to shared state or unversioned interfaces cannot be deployed independently. Start with Tightly Coupled Monolith.
  2. Does every user-facing request require a synchronous chain across multiple services? If a single business operation touches three or more services in sequence, the service boundaries were drawn in the wrong place. You have a distributed monolith. Start with Distributed Monolith.
  3. Was the feature decomposed by service or team rather than by behavior? If each team built their piece of the feature independently and now all pieces must go out together, the work was sliced horizontally. Start with Horizontal Slicing.
  4. Could each service technically be deployed on its own, but process or policy prevents it? If the coupling is in the release process (shared release window, cross-team sign-off, manual integration test gate) rather than in the code, the constraint is organizational. Start with Undone Work and examine whether the definition of done requires unnecessary coordination.

Ready to fix this? The most common cause is Tightly Coupled Monolith. Start with its How to Fix It section for week-by-week steps.


2.5 - Work Requires Sign-Off from Teams Not Involved in Delivery

Changes cannot ship without approval from architecture review boards, legal, compliance, or other teams that are not part of the delivery process and have their own schedules.

What you are seeing

A change is ready to ship. Before it can go to production, it requires sign-off from an architecture review board, a legal review for data handling, a compliance team for regulatory requirements, or some combination of these. Each reviewing team has its own meeting cadence. The architecture board meets every two weeks. Legal responds when they have capacity. Compliance has a queue.

The team submits the request and waits. In the meantime, the code sits in a branch or is merged behind a feature flag, accumulating risk as the codebase moves around it. When approval finally arrives, the original context has faded. If the reviewer requests changes, the wait restarts. The team learns to front-load reviews by submitting for approval before development is complete, but the timing never aligns perfectly and changes after approval trigger new review cycles.

Common causes

Compliance Interpreted as Manual Approval

Compliance requirements - security controls, audit trails, regulatory evidence - are real and necessary. The problem is when compliance is operationalized as manual sign-off rather than as automated verification. A control that requires a human to review and approve every change is a bottleneck by design. The same control expressed as an automated check in the pipeline is fast, consistent, and more reliable. Manual approval processes grow over time as new requirements are added and old ones are never removed.

Read more: Compliance Interpreted as Manual Approval

Separation of Duties as Separate Teams

Separation of duties is a legitimate control for high-risk changes. It becomes an anti-pattern when it is implemented as a structural requirement that every change go through a different team for approval, regardless of risk level. Low-risk routine changes get the same review overhead as high-risk changes. The review team becomes a bottleneck because they are reviewing everything rather than focusing on changes that actually warrant scrutiny.

Read more: Separation of Duties as Separate Teams

How to narrow it down

  1. Are approval gates mandatory regardless of change risk? If a trivial config change and a major architectural change go through the same review process, the gate is not calibrated to risk. Start with Separation of Duties as Separate Teams.
  2. Could the compliance requirement be expressed as an automated check? If the review consists of a human verifying something that a tool could verify faster and more consistently, the control should be automated. Start with Compliance Interpreted as Manual Approval.

Ready to fix this? The most common cause is Compliance Interpreted as Manual Approval. Start with its How to Fix It section for week-by-week steps.


2.6 - Database Migrations Block or Break Deployments

Schema changes require downtime, lock tables, or leave the database in an unknown state when they fail mid-run.

What you are seeing

Deploying a schema change is a stressful event. The team schedules a maintenance window, notifies users, and runs the migration hoping nothing goes wrong. Some migrations take minutes; others run for hours and lock tables the application needs. When a migration fails halfway through, the database is in an intermediate state that neither the old nor the new version of the application can handle correctly.

The team has developed rituals to cope. Migrations are reviewed by the entire team before running. Someone sits at the database console during the deployment ready to intervene. A migration runbook exists listing each migration and its estimated run time. New features requiring schema changes get batched with the migration to minimize the number of deployment events.

Feature development is constrained by when migrations can safely run. The team avoids schema changes when possible, leading to workarounds and accumulated schema debt. When a migration does run, it is a high-stakes event rather than a routine operation.

Common causes

Manual deployments

When deployments are manual, migration execution is manual too. There is no standardized approach to handling migration failures, rollback, or state verification. Each migration is a custom operation executed by whoever is available that day, following a procedure remembered from the last time rather than codified in an automated step.

Automated pipelines that run migrations as a defined step - with pre-migration backups, health checks after migration, and defined rollback procedures - replace the maintenance window ritual with a repeatable process. Failures trigger automated alerts rather than requiring someone to sit at the console. When migrations run the same way every time, the team stops batching them to minimize deployment events because each one is no longer a high-stakes manual operation.

Read more: Manual deployments

Snowflake environments

When environments differ from production in undocumented ways, migrations that pass in staging fail in production. Data volumes are different. Index configurations were set differently. Existing data in production that was not in staging violates a constraint the migration adds. These differences are invisible until the migration runs against real data and fails.

Environments that match production in structure and configuration allow migrations to be validated before the maintenance window. When staging has production-like data volume and index configuration, a migration that completes without locking tables in staging will behave the same way in production. The team stops discovering migration failures for the first time during the deployment that users are waiting on.

Read more: Snowflake environments

Missing deployment pipeline

A pipeline can enforce migration ordering and safety practices as part of every deployment. Expand-contract patterns - adding new columns before removing old ones - can be built into the pipeline structure. Pre-migration schema checks and post-migration application health verification become automatic steps.

Without a pipeline, migration ordering is left to whoever is executing the deployment. The right sequence is known by the person who thought through the migration, but that knowledge is not enforced at deployment time - which is why the team schedules reviews and sits someone at the console. The pipeline encodes that knowledge so it runs correctly without anyone needing to supervise it.

Read more: Missing deployment pipeline

Tightly coupled monolith

When a large application shares a single database schema, any migration affects the entire system simultaneously. There is no safe way to migrate incrementally because all code runs against the same schema at the same time. A column rename requires updating every query in every module before the migration runs.

Decomposed services with separate databases can migrate their own schema independently. A migration to the payment service schema does not require coordinating with the user service, scheduling a shared maintenance window, or batching with unrelated changes to amortize the disruption. Each service manages its own schema on its own schedule.

Read more: Tightly coupled monolith

How to narrow it down

  1. Are migrations run manually during deployment? If someone executes migration scripts by hand, the process lacks the consistency and failure handling of automation. Start with Manual deployments.
  2. Do migrations behave differently in staging versus production? Environment differences - data volume, configuration, existing data - are the likely cause. Start with Snowflake environments.
  3. Does the deployment pipeline handle migration ordering and validation? If migrations run outside the pipeline, they lack the pipeline’s safety checks. Start with Missing deployment pipeline.
  4. Do schema changes require coordination across multiple teams or modules? If one migration touches code owned by many teams, the coupling is the root issue. Start with Tightly coupled monolith.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

2.7 - Every Deployment Is Immediately Visible to All Users

There is no way to deploy code without activating it for users. All deployments are full releases with no controlled rollout.

What you are seeing

The team deploys and releases in a single step. When code reaches production, it is immediately live for every user. There is no mechanism to deploy an incomplete feature, route traffic to a new version gradually, or test new behavior in production before a full rollout.

This constraint shapes how the team works. Features must be fully complete before they can be deployed. Partially built functionality cannot live in production even in a dormant state. The team must complete entire features end to end before getting production feedback, which means feedback arrives only at the end of development - when changing course is most expensive.

For teams shipping to large user bases, the absence of controlled rollout means every deployment is an all-or-nothing event. An issue that affects 10% of users under specific conditions immediately affects 100% of users. The team cannot limit blast radius by controlling exposure, cannot validate behavior with a subset of real traffic, and cannot respond to emerging problems before they become full incidents.

Common causes

Monolithic work items

When work items are large, the absence of release separation matters more. A feature that takes one week to build can be deployed as a cohesive unit with acceptable risk. A feature that takes three months has accumulated enough scope and uncertainty that deploying it to all users simultaneously carries substantial risk. Large work items amplify the need for controlled rollout.

Decomposing work into smaller items reduces the blast radius of any individual deployment even without explicit release mechanisms. When each deployment contains a small, focused change, an issue that surfaces in production affects a narrow area. The team is no longer in the position where a single all-or-nothing deployment immediately affects every user with no ability to limit exposure.

Read more: Monolithic work items

Missing deployment pipeline

A pipeline that supports blue-green deployments, canary releases, or feature flag integration requires infrastructure that does not exist without deliberate investment. Traffic routing, percentage rollouts, and gradual exposure are capabilities built on top of a mature deployment pipeline. Without the pipeline foundation, these capabilities cannot be added.

A pipeline with deployment controls transforms release strategy from “deploy everything now” to “deploy to N percent of traffic, watch metrics, expand or roll back.” The team moves from all-or-nothing deployments that immediately expose every user to a new version, to controlled rollouts where a problem that would have affected 100% of users is caught when it affects 5%.

Read more: Missing deployment pipeline

Horizontal slicing

When stories are organized by technical layer rather than user-visible behavior, complete functionality requires all layers to be done before anything ships. An API endpoint with no UI and a UI component that calls no API are both non-functional in isolation. The team cannot deploy incrementally because nothing is usable until all layers are complete.

Vertical slices deliver thin but complete functionality - a user can accomplish something with each slice. These can be deployed as soon as they are done, independently of other slices. The team gets production feedback continuously rather than at the end of a large batch.

Read more: Horizontal slicing

How to narrow it down

  1. Can the team deploy code to production without immediately exposing it to users? If every deployment activates immediately for all users, deploy and release are coupled. Start with Missing deployment pipeline.
  2. How large are typical deployments? Large deployments have more surface area for problems. Start with Monolithic work items.
  3. Are features built as complete end-to-end slices or as technical layers? Layered development prevents incremental delivery. Start with Horizontal slicing.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

2.8 - The Team Is Afraid to Deploy

Production deployments cause anxiety because they frequently fail. The team delays deployments, which increases batch size, which increases risk.

What you are seeing

Nobody wants to deploy on a Friday. Or a Thursday. Ideally, deployments happen early in the week when the team is available to respond to problems. The team has learned through experience that deployments break things, so they treat each deployment as a high-risk event requiring maximum staffing and attention.

Developers delay merging “risky” changes until after the next deploy so their code does not get caught in the blast radius. Release managers add buffer time between deploys. The team informally agrees on a deployment cadence (weekly, biweekly) that gives everyone time to recover between releases.

The fear is rational. Deployments do break things. But the team’s response (deploy less often, batch more changes, add more manual verification) makes each deployment larger, riskier, and more likely to fail. The fear becomes self-reinforcing.

Common causes

Manual Deployments

When deployment requires human execution of steps, each deployment carries human error risk. The team has experienced deployments where a step was missed, a script was run in the wrong order, or a configuration was set incorrectly. The fear is not of the code but of the deployment process itself. Automated deployments that execute the same steps identically every time eliminate the process-level risk.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, the team has no confidence that the deployed artifact has been properly built and tested. Did someone run the tests? Are we deploying the right version? Is this the same artifact that was tested in staging? Without a pipeline that enforces these checks, every deployment requires the team to manually verify the prerequisites.

Read more: Missing Deployment Pipeline

Blind Operations

When the team cannot observe production health after a deployment, they have no way to know quickly whether the deploy succeeded or failed. The fear is not just that something will break but that they will not know it broke until a customer reports it. Monitoring and automated health checks transform deployment from “deploy and hope” to “deploy and verify.”

Read more: Blind Operations

Manual Testing Only

When the team has no automated tests, they have no confidence that the code works before deploying it. Manual testing provides some coverage, but it is never exhaustive, and the team knows it. Every deployment carries the risk that an untested code path will fail in production. A comprehensive automated test suite gives the team evidence that the code works, replacing hope with confidence.

Read more: Manual Testing Only

Monolithic Work Items

When changes are large, each deployment carries more risk simply because more code is changing at once. A deployment with 200 lines changed across 3 files is easy to reason about and easy to roll back. A deployment with 5,000 lines changed across 40 files is unpredictable. Small, frequent deployments reduce risk per deployment rather than accumulating it.

Read more: Monolithic Work Items

How to narrow it down

  1. Is the deployment process automated? If a human runs the deployment, the fear may be of the process, not the code. Start with Manual Deployments.
  2. Does the team have an automated pipeline from commit to production? If not, there is no systematic guarantee that the right artifact with the right tests reaches production. Start with Missing Deployment Pipeline.
  3. Can the team verify production health within minutes of deploying? If not, the fear includes not knowing whether the deploy worked. Start with Blind Operations.
  4. Does the team have automated tests that provide confidence before deploying? If not, the fear is that untested code will break. Start with Manual Testing Only.
  5. How many changes are in a typical deployment? If deployments are large batches, the risk per deployment is high by construction. Start with Monolithic Work Items.

Ready to fix this? The most common cause is Manual Deployments. Start with its How to Fix It section for week-by-week steps.

2.9 - Hardening Sprints Are Needed Before Every Release

The team dedicates one or more sprints after “feature complete” to stabilize code before it can be released.

What you are seeing

After the team finishes building features, nothing is ready to ship. A “hardening sprint” is scheduled: one or more sprints dedicated to bug fixing, stabilization, and integration testing. No new features are built during this period. The team knows from experience that the code is not production-ready when development ends.

The hardening sprint finds bugs that were invisible during development. Integration issues surface because components were built in isolation. Performance problems appear under realistic load. Edge cases that nobody tested during development cause failures. The hardening sprint is not optional because skipping it means shipping broken software.

The team treats this as normal. Planning includes hardening time by default. A project that takes four sprints to build is planned as six: four for features, two for stabilization.

Common causes

Manual Testing Only

When the team has no automated test suite, quality verification happens manually at the end. The hardening sprint is where manual testers find the defects that automated tests would have caught during development. Without automated regression testing, every release requires a full manual pass to verify nothing is broken.

Read more: Manual Testing Only

Inverted Test Pyramid

When most tests are slow end-to-end tests and few are unit tests, defects in business logic go undetected until integration testing. The E2E tests are too slow to run continuously, so they run at the end. The hardening sprint is when the team finally discovers what was broken all along.

Read more: Inverted Test Pyramid

Undone Work

When the team’s definition of done does not include deployment and verification, stories are marked complete while hidden work remains. Testing, validation, and integration happen after the story is “done.” The hardening sprint is where all that undone work gets finished.

Read more: Undone Work

Monolithic Work Items

When features are built as large, indivisible units, integration risk accumulates silently. Each large feature is developed in relative isolation for weeks. The hardening sprint is the first time all the pieces come together, and the integration pain is proportional to the batch size.

Read more: Monolithic Work Items

Pressure to Skip Testing

When management pressures the team to maximize feature output, testing is deferred to “later.” The hardening sprint is that “later.” Testing was not skipped; it was moved to the end where it is less effective, more expensive, and blocks the release.

Read more: Pressure to Skip Testing

How to narrow it down

  1. Does the team have automated tests that run on every commit? If not, the hardening sprint is compensating for the lack of continuous quality verification. Start with Manual Testing Only.
  2. Are most automated tests end-to-end or UI tests? If the test suite is slow and top-heavy, defects are caught late because fast unit tests are missing. Start with Inverted Test Pyramid.
  3. Does the team’s definition of done include deployment and verification? If stories are “done” before they are tested and deployed, the hardening sprint finishes what “done” should have included. Start with Undone Work.
  4. How large are the typical work items? If features take weeks and integrate at the end, the batch size creates the integration risk. Start with Monolithic Work Items.
  5. Is there pressure to prioritize features over testing? If testing is consistently deferred to hit deadlines, the hardening sprint absorbs the cost. Start with Pressure to Skip Testing.

Ready to fix this? The most common cause is Manual Testing Only. Start with its How to Fix It section for week-by-week steps.

2.10 - Releases Are Infrequent and Painful

Deploying happens monthly, quarterly, or less. Each release is a large, risky event that requires war rooms and weekend work.

What you are seeing

The team deploys once a month, once a quarter, or on some irregular cadence that nobody can predict. Each release is a significant event. There is a release planning meeting, a deployment runbook, a designated release manager, and often a war room during the actual deploy. People cancel plans for release weekends.

Between releases, changes pile up. By the time the release goes out, it contains dozens or hundreds of changes from multiple developers. Nobody can confidently say what is in the release without checking a spreadsheet or release notes document. When something breaks in production, the team spends hours narrowing down which of the many changes caused the problem.

The team wants to release more often but feels trapped. Each release is so painful that adding more releases feels like adding more pain.

Common causes

Manual Deployments

When deployment requires a human to execute steps (SSH into servers, run scripts, click through a console), the process is slow, error-prone, and dependent on specific people being available. The cost of each deployment is high enough that the team batches changes to amortize it. The batch grows, the risk grows, and the release becomes an event rather than a routine.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, every release requires manual coordination of builds, tests, and deployments. Without a pipeline, the team cannot deploy on demand because the process itself does not exist in a repeatable form.

Read more: Missing Deployment Pipeline

CAB Gates

When every production change requires committee approval, the approval cadence sets the release cadence. If the Change Advisory Board meets weekly, releases happen weekly at best. If the meeting is biweekly, releases are biweekly. The team cannot deploy faster than the approval process allows, regardless of technical capability.

Read more: CAB Gates

Monolithic Work Items

When work is not decomposed into small, independently deployable increments, each “feature” is a large batch of changes that takes weeks to complete. The team cannot release until the feature is done, and the feature is never done quickly because it was scoped too large. Small batches enable frequent releases. Large batches force infrequent ones.

Read more: Monolithic Work Items

Manual Regression Testing Gates

When every release requires a manual test pass that takes days or weeks, the testing cadence limits the release cadence. The team cannot release until QA finishes, and QA cannot finish faster because the test suite is manual and grows with every feature.

Read more: Manual Regression Testing Gates

How to narrow it down

  1. Is the deployment process automated? If deploying requires human steps beyond pressing a button, the process itself is the bottleneck. Start with Manual Deployments.
  2. Does a pipeline exist that can take code from commit to production? If not, the team cannot release on demand because the infrastructure does not exist. Start with Missing Deployment Pipeline.
  3. Does a committee or approval board gate production changes? If releases wait for scheduled approval meetings, the approval cadence is the constraint. Start with CAB Gates.
  4. How large is the typical work item? If features take weeks and are delivered as single units, the batch size is the constraint. Start with Monolithic Work Items.
  5. Does a manual test pass gate every release? If QA takes days per release, the testing process is the constraint. Start with Manual Regression Testing Gates.

Ready to fix this? The most common cause is Manual Deployments. Start with its How to Fix It section for week-by-week steps.

2.11 - Merge Freezes Before Deployments

Developers announce merge freezes because the integration process is fragile. Deploying requires coordination in chat.

What you are seeing

A message appears in the team chat: “Please don’t merge to main, I’m about to deploy.” The deployment process requires the main branch to be stable and unchanged for the duration of the deploy. Any merge during that window could invalidate the tested artifact, break the build, or create an inconsistent state between what was tested and what ships.

Other developers queue up their PRs and wait. If the deployment hits a problem, the freeze extends. Sometimes the freeze lasts hours. In the worst cases, the team informally agrees on “deployment windows” where merging is allowed at certain times and deployments happen at others.

The merge freeze is a coordination tax. Every deployment interrupts the entire team’s workflow. Developers learn to time their merges around deploy schedules, adding mental overhead to routine work.

Common causes

Manual Deployments

When deployment is a manual process (running scripts, clicking through UIs, executing a runbook), the person deploying needs the environment to hold still. Any change to main during the deployment window could mean the deployed artifact does not match what was tested. Automated deployments that build, test, and deploy atomically eliminate this window because the pipeline handles the full sequence without requiring a stable pause.

Read more: Manual Deployments

Integration Deferred

When the team does not have a reliable CI process, merging to main is itself risky. If the build breaks after a merge, the deployment is blocked. The team freezes merges not just to protect the deployment but because they lack confidence that any given merge will keep main green. If CI were reliable, merging and deploying could happen concurrently because main would always be deployable.

Read more: Integration Deferred

Missing Deployment Pipeline

When there is no pipeline that takes a specific commit through build, test, and deploy as a single atomic operation, the team must manually coordinate which commit gets deployed. A pipeline pins the deployment to a specific artifact built from a specific commit. Without it, the team must freeze merges to prevent the target from moving while they deploy.

Read more: Missing Deployment Pipeline

How to narrow it down

  1. Is the deployment process automated end-to-end? If a human executes deployment steps, the freeze protects against variance in the manual process. Start with Manual Deployments.
  2. Does the team trust that main is always deployable? If merges to main sometimes break the build, the freeze protects against unreliable integration. Start with Integration Deferred.
  3. Does the pipeline deploy a specific artifact from a specific commit? If there is no pipeline that pins the deployment to an immutable artifact, the team must manually ensure the target does not move. Start with Missing Deployment Pipeline.

Ready to fix this? The most common cause is Manual Deployments. Start with its How to Fix It section for week-by-week steps.

2.12 - No Evidence of What Was Deployed or When

The team cannot prove what version is running in production, who deployed it, or what tests it passed.

What you are seeing

An auditor asks a simple question: what version of the payment service is currently running in production, when was it deployed, who authorized it, and what tests did it pass? The team opens a spreadsheet, checks Slack history, and pieces together an answer from memory and partial records. The spreadsheet was last updated two months ago. The Slack message that mentioned the deployment contains a commit hash but not a build number. The CI system shows jobs that ran, but the logs have been pruned.

Each deployment was treated as a one-time event. Records were not kept because nobody expected to need them. The process that makes deployments auditable is the same process that makes them reliable: a pipeline that creates a versioned artifact, records its provenance, and logs each promotion through environments.

Outside of formal audit requirements, the same problem shows up as operational confusion. The team is not sure what is running in production because deployments happen at different times by different people without a centralized record. Debugging a production issue requires determining which version introduced the behavior, which requires reconstructing the deployment history from whatever partial records exist.

Common causes

Manual deployments

Manual deployments leave no systematic record. Who ran them, what they ran, and when are questions whose answers depend on the discipline of individual operators. Some engineers write Slack messages when they deploy; others do not. Some keep notes; most do not. The audit trail is as complete as the most diligent person’s habits.

Automated deployments with pipeline logs create an audit trail as a side effect of execution. The pipeline records every run: who triggered it, what artifact was deployed, which tests passed, and what the deployment target was. This information exists without anyone having to remember to record it.

Read more: Manual deployments

Missing deployment pipeline

A pipeline produces structured, queryable records of every deployment. Which artifact, which environment, which tests passed, which user triggered the run - all of this is captured automatically. Without a pipeline, audit evidence must be manufactured from logs, Slack messages, and memory rather than extracted from the deployment process itself.

When auditors require evidence of deployment controls, a pipeline makes compliance straightforward. The pipeline log is the compliance record. Without a pipeline, compliance documentation is a manual reporting exercise conducted after the fact.

Read more: Missing deployment pipeline

Snowflake environments

When environments are hand-configured, the concept of “what version is deployed” becomes ambiguous. A snowflake environment may have been modified in place after the last deployment - a config file edited directly, a package updated on the server, a manual hotfix applied. The artifact version in the deployment log may not accurately reflect the current state of the environment.

Environments defined as code have their state recorded in version control. The current state of an environment is the current state of the infrastructure code that defines it. When the auditor asks whether production was modified since the last deployment, the answer is in the git log - not in a manual check of whether someone may have edited a config file on the server.

Read more: Snowflake environments

How to narrow it down

  1. Can the team identify the exact artifact version currently in production? If not, there is no artifact tracking. Start with Missing deployment pipeline.
  2. Is there a complete log of who deployed what and when? If deployment records depend on engineers remembering to write Slack messages, the record will have gaps. Start with Manual deployments.
  3. Could the environment have been modified since the last deployment? If production servers can be changed outside the deployment process, the deployment log does not represent the current state. Start with Snowflake environments.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

2.13 - Deployments Are One-Way Doors

If a deployment breaks production, the only option is a forward fix under pressure. Rolling back has never been practiced or tested.

What you are seeing

When something breaks in production, the only option is a forward fix. Rolling back has never been practiced and there is no defined procedure for it. The previous version artifacts may not exist. Nobody is sure of the exact steps. The unspoken understanding is that deployments only go forward.

There is no defined reversal procedure. Database migrations run during deployment but rollback migrations were never written. The build server from the previous deployment was recycled. Configuration was updated in place. Even if someone wanted to roll back, they would need to reconstruct the previous state from memory - and that assumes the database is in a compatible state, which it often is not.

The team compensates by delaying deployments, adding more manual verification before each one, and keeping deployments large so there are fewer of them. Each of these adaptations makes deployments larger and riskier - exactly the opposite of what reduces the risk.

Common causes

Manual deployments

When deployment is a manual process, there is no corresponding automated rollback procedure. The operator who ran the deployment must figure out how to reverse each step under pressure, without having practiced the reversal. The steps that were run forward must be recalled and undone in the right order, often by someone who was not the original operator.

With automated deployments, rollback is the same procedure as a deployment - just pointed at the previous artifact. The team practices rollback every time they deploy, so when they need it, the steps are known and the process works. There is no scramble to reconstruct what the previous state was.

Read more: Manual deployments

Missing deployment pipeline

A pipeline creates a versioned artifact from a specific commit and promotes it through environments. That artifact can be redeployed to roll back. Without a pipeline, there is no defined artifact to restore, no promotion history to reverse, and no guarantee that a previous build can be reproduced.

When the pipeline exists, every previous artifact is stored and addressable. Rolling back means redeploying a known artifact through the same automated process used to deploy new versions. The team no longer faces the situation of needing to reconstruct a previous state from memory under pressure.

Read more: Missing deployment pipeline

Blind operations

If the team cannot detect a bad deployment within minutes, they face a choice: roll back something that might be fine, or wait until the damage is certain. When detection takes hours, forward state has accumulated - new database writes, customer actions, downstream events - to the point where rollback is impractical even if someone wanted to do it.

Fast detection changes the math. When the team knows within five minutes that a deployment caused a spike in errors, rollback is still a viable option. The window for clean rollback is open. Monitoring and health checks that fire immediately after deployment keep that window open long enough to use.

Read more: Blind operations

Snowflake environments

When production is a hand-configured environment, “previous state” is not a well-defined concept. There is no snapshot to restore, no configuration-as-code to check out at a previous revision. Rolling back would require manually reconstructing the previous configuration from memory.

Environments defined as code have a previous state by definition: the previous commit to the infrastructure repository. Rolling back the environment means checking out that commit and applying it. The team no longer faces the situation where “previous state” is something they would have to reconstruct from memory - it is in version control and can be restored.

Read more: Snowflake environments

How to narrow it down

  1. Is the deployment process automated? If not, rollback requires the same manual execution under pressure - without practice. Start with Manual deployments.
  2. Does the team have an artifact registry retaining previous versions? If not, even attempting rollback requires reconstructing a previous build. Start with Missing deployment pipeline.
  3. How quickly does the team detect deployment problems? If detection takes more than 30 minutes, rollback is often impractical by the time it is considered. Start with Blind operations.
  4. Can the team recreate a previous environment state from code? If environments are hand-configured, there is no defined previous state to return to. Start with Snowflake environments.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

2.14 - Teams Cannot Change Their Own Pipeline Without Another Team

Adding a build step, updating a deployment config, or changing an environment variable requires filing a ticket with a platform or DevOps team and waiting.

What you are seeing

A developer needs to add a security scan to the pipeline. They open the pipeline configuration and find it lives in a repository they do not have write access to, managed by the platform team. They file a ticket describing the change. The platform team reviews it, asks clarifying questions, schedules it for next sprint. The change ships two weeks later.

The same pattern repeats for every pipeline modification: adding a new test stage, updating a deployment timeout, rotating a secret, enabling a feature flag in the pipeline. Each change is a ticket, a queue, a wait. Teams learn to live with suboptimal pipeline configurations rather than pay the cost of requesting every improvement. The pipeline calcifies - nobody changes it because changing it is expensive, so problems accumulate and are worked around rather than fixed.

Common causes

Separate Ops/Release Team

When a dedicated team owns the pipeline infrastructure, delivery teams have no path to change it themselves. The platform team controls who can modify pipeline definitions, which environments are available, and how deployments are structured. This separation was often put in place for consistency or security reasons, but the effect is that the teams doing the work cannot improve the process supporting that work. Every pipeline improvement requires cross-team coordination, which means most improvements never happen.

Read more: Separate Ops/Release Team

Pipeline Definitions Not in Version Control

When pipeline configurations are managed through a GUI, a proprietary tool, or some other mechanism outside version control, delivery teams cannot own them in the same way they own their application code. There is no pull request process for pipeline changes, no way to review or roll back, and no natural path for the delivery team to make changes. The configuration lives in a system controlled by whoever administers the pipeline tool, which is typically not the delivery team.

Read more: Pipeline Definitions Not in Version Control

No Infrastructure as Code

When infrastructure is configured manually rather than defined as code, changes require access to systems and knowledge that delivery teams typically do not have. A delivery team cannot self-service a new environment or update a deployment target without someone who has access to the infrastructure tooling. Infrastructure as code puts the configuration in files the delivery team can read, propose changes to, and own, removing the dependency on the platform team for every modification.

Read more: No Infrastructure as Code

How to narrow it down

  1. Do delivery teams have write access to their own pipeline configuration? If the pipeline lives in a repository or system the team cannot modify, they cannot own their delivery process. Start with Separate Ops/Release Team.
  2. Is the pipeline defined in version-controlled files? If pipeline configuration lives in a GUI or proprietary system rather than code, there is no natural path for team ownership. Start with Pipeline Definitions Not in Version Control.
  3. Is infrastructure defined as code that the delivery team can read and propose changes to? If infrastructure is managed manually by another team, self-service is not possible. Start with No Infrastructure as Code.

Ready to fix this? The most common cause is Separate Ops/Release Team. Start with its How to Fix It section for week-by-week steps.


2.15 - New Releases Introduce Regressions in Previously Working Functionality

Something that worked before the release is broken after it. The team spends time after every release chasing down what changed and why.

What you are seeing

The release goes out. Within hours, bug reports arrive for behavior that was working before the release. A calculation that was correct is now wrong. A form submission that was completing now errors. A feature that was visible is now missing. The team starts bisecting the release, searching through a large set of changes to find which one caused the regression.

Post-mortems for regressions tend to follow the same pattern: the change that caused the problem looked safe in isolation, but it interacted with another change in an unexpected way. Or the code path that broke was not covered by any automated test, so nobody saw the breakage until a user reported it. Or a configuration value changed alongside the code change, and the combination behaved differently than either change alone.

Regressions erode trust in the team’s ability to release safely. The team responds by adding more manual checks before releases, which slows the release cycle, which increases batch size, which increases the surface area for the next regression.

Common causes

Large Release Batches

When releases contain many changes - dozens of commits, multiple features, several bug fixes - the surface area for regressions grows with the batch size. Each change is a potential source of breakage. Changes that are individually safe can interact in unexpected ways when they ship together. Diagnosing which change caused the regression requires searching through a large set of candidates. Small, frequent releases make regressions rare because each release contains few changes, and when one does occur, the cause is obvious.

Read more: Infrequent, Painful Releases

Testing Only at the End

When tests run only immediately before a release rather than continuously throughout development, regressions accumulate silently between test runs. A change that breaks existing behavior is not detected until the pre-release test cycle, by which time more code has been built on top of the broken behavior. The longer the gap between when the regression was introduced and when it is found, the more expensive it is to fix.

Read more: Testing Only at the End

Long-Lived Feature Branches

When developers work on branches that diverge from the main codebase for days or weeks, merging creates interactions that were never tested. Each branch was developed and tested independently. When they merge, the combined code behaves differently than either branch alone. The larger the divergence, the more likely the merge produces unexpected behavior that manifests as a regression in previously working functionality.

Read more: Long-Lived Feature Branches

Fixes Applied to the Release Branch but Not to Trunk

When a defect is found in a released version, the team branches from the release tag and applies a fix to that branch to ship a patch quickly. If the fix is never ported back to trunk, the next release from trunk still contains the defect. The patch branch and trunk have diverged: the patch has the fix, trunk does not.

The correct sequence is to fix trunk first, then cherry-pick the fix to the release branch. This guarantees trunk always contains the fix and subsequent releases from trunk are not affected.

Two diagrams comparing hotfix approaches. Anti-pattern: release branch branched from v1.0, fix applied to release branch only, porting back to trunk is extra work easily forgotten after the emergency, defect persists in future trunk releases. Correct: fix applied to trunk first, then cherry-picked to the release branch, all future releases from trunk include the fix.

Read more: Release Branches with Extensive Backporting

How to narrow it down

  1. How many changes does a typical release contain? If a release contains more than a handful of commits, the batch size is a risk factor. Reducing release frequency reduces the chance of interactions and makes regressions easier to diagnose. Start with Infrequent, Painful Releases.
  2. Do tests run on every commit or only before a release? If the team discovers regressions at release time, the feedback loop is too long. Tests should catch breakage within minutes of the change being pushed. Start with Testing Only at the End.
  3. Are developers working on branches that diverge from the main codebase for more than a day? If yes, untested merge interactions are a likely source of regressions. Start with Long-Lived Feature Branches.
  4. Does the same regression appear in multiple releases? If a bug that was fixed in a patch release keeps coming back, the fix was applied to the release branch but never merged to trunk. Start with Release Branches with Extensive Backporting.

Ready to fix this? The most common cause is Testing Only at the End. Start with its How to Fix It section for week-by-week steps.


2.16 - Releases Depend on One Person

A single person coordinates and executes all production releases. Deployments stop when that person is unavailable.

What you are seeing

Deployments stop when one person is unavailable. The team has a release manager - or someone who has informally become one - who holds the institutional knowledge of how deployments work. They know which config values need to be updated, which services need to restart in which order, which monitoring dashboards to watch, and what warning signs of a bad deploy look like. When they go on vacation, the team either waits for them to return or attempts a deployment with noticeably less confidence.

The release manager’s calendar becomes a constraint on when the team can ship. Releases are scheduled around their availability. On-call engineers will not deploy without them present because the process is too opaque to navigate alone. When a production incident requires a hotfix, the first step is “find that person” rather than “follow the rollback procedure.”

The bottleneck is rarely a single person’s fault. It reflects a deployment process that was never made systematic or automated. Knowledge accumulated in one person because the process was never documented in a way that made it executable without that person. The team worked around the complexity rather than removing it.

Common causes

Manual deployments

Manual deployments require human expertise. When the steps are not automated, a deployment is only as reliable as the person executing it. Over time, the most experienced person becomes the de-facto release manager by default - not because anyone decided this, but because they have done it the most times and accumulated the most context.

Automated deployments remove the dependency on individual skill. The pipeline executes the same steps identically every time, regardless of who triggers it. Any team member can initiate a deployment by running the pipeline; the expertise is encoded in the automation rather than in a person.

Read more: Manual deployments

Knowledge silos

The deployment process knowledge is not written down or codified. It lives in one person’s head. When that person leaves or is unavailable, the knowledge gap is immediately felt. The team discovers gaps in their collective knowledge only when the person who filled those gaps is not present.

Externalizing deployment knowledge into runbooks, pipeline definitions, and infrastructure code means the on-call engineer can deploy without finding the one person who knows the steps. The pipeline definition is readable by any engineer. When a production incident requires a hotfix, the first step is “follow the procedure” rather than “find that person.”

Read more: Knowledge silos

Snowflake environments

When environments are hand-configured and differ from each other in undocumented ways, releases require someone who has memorized those differences. The person who configured the environment knows which server needs the manual step and which config file is different from the others. Without that person, the deployment is a minefield of undocumented quirks.

Environments defined as code have their differences captured in the code. Any engineer reading the infrastructure definition can understand what is deployed where and why. The deployment procedure is the same regardless of which environment is the target.

Read more: Snowflake environments

Missing deployment pipeline

A pipeline codifies deployment knowledge as executable code. Every step is documented, versioned, and runnable by any team member. The pipeline is the answer to “how do we deploy” - not a person, not a wiki page, but an automated procedure that the team maintains together.

Without a pipeline, the knowledge of how to deploy stays in the people who have done it. The release manager’s calendar remains a constraint on when the team can ship because no executable procedure exists that someone else could follow in their place. Any engineer can trigger the pipeline; no one can trigger another person’s institutional memory.

Read more: Missing deployment pipeline

How to narrow it down

  1. Can any engineer on the team deploy to production without help? If not, the deployment process has concentrations of required knowledge. Start with Knowledge silos.
  2. Is the deployment process automated end to end? If a human runs deployment steps manually, expertise concentrates by default. Start with Manual deployments.
  3. Do environments have undocumented configuration differences? If different environments require different steps known only to certain people, the environments are the knowledge trap. Start with Snowflake environments.
  4. Does a written pipeline definition exist in version control? If not, the team has no shared, authoritative record of the deployment process. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

2.17 - Security Review Is a Gate, Not a Guardrail

Changes queue for weeks waiting for central security review. Security slows delivery rather than enabling it.

What you are seeing

The queue for security review is weeks long. Changes that are otherwise ready to deploy sit waiting while the central security team works through backlog from across the organization. When security review finally happens, it is often a cursory check because the backlog pressure is too high for thorough review.

Security reviews happen late in the development cycle, after development is complete and the team has moved on to new work. When the security team identifies a real issue, it requires context-switching back to code written weeks ago. Developers have forgotten the details. The fix takes longer than it would have if the security issue had been caught during development.

The security team does not scale with development velocity. As the organization ships more, the security queue grows. The team has learned to front-load reviews for “obviously security-sensitive” changes and skip or rush reviews for everything else - exactly the wrong approach. The changes that seem routine are often where vulnerabilities hide.

Common causes

Missing deployment pipeline

Security tools can be integrated directly into the pipeline: dependency scanning, static analysis, secret detection, container image scanning. When these checks run automatically on every commit, they catch issues immediately - while the developer still has the code in mind and fixing is fast. The central security team can focus on policy and architecture rather than reviewing individual changes.

A pipeline with automated security gates provides continuous, scalable security coverage. The coverage is consistent because it runs on every change, not just the ones that reach the security team’s queue. Issues are caught in minutes rather than weeks.

Read more: Missing deployment pipeline

CAB gates

The same dynamics that make change advisory boards a bottleneck for general changes apply to security review gates. Manual approval at the end of the process creates a queue. The queue grows when the team ships more than the reviewers can process. Calendar-driven release cycles create bursts of review requests at predictable times.

Moving security left - into development tooling and pipeline gates rather than release gates - eliminates the end-of-process queue entirely. Security feedback during development is faster and cheaper than security review after development.

Read more: CAB gates

Manual regression testing gates

When security review is one of several manual gates a change must pass, the waits compound. A change waiting for regression testing cannot enter the security review queue. A change completing security review cannot go to production until the regression window opens. Each gate multiplies the total lead time for a change.

Automated testing eliminates the regression testing gate, which reduces how many changes are stacked up waiting for security review at any given time. A change that exits automated testing immediately enters the security queue rather than waiting for a regression window to open. Shrinking the queue makes each security review faster and more thorough - which is what was lost when backlog pressure turned reviews into cursory checks.

Read more: Manual regression testing gates

How to narrow it down

  1. Does the team have automated security scanning in the CI pipeline? If not, security coverage depends on the central security team’s capacity, which does not scale. Start with Missing deployment pipeline.
  2. Is security review a manual approval gate before every production deployment? If changes cannot deploy without explicit security approval, the gate is the constraint. Start with CAB gates.
  3. Do changes queue for multiple manual approvals in sequence? If security review is one of several sequential gates, reducing other gates will also reduce security review pressure. Start with Manual regression testing gates.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

2.18 - Services Reach Production with No Health Checks or Alerting

No criteria exist for what a service needs before going live. New services deploy to production with no observability in place.

What you are seeing

A new service ships and the team moves on. Three weeks later, an on-call engineer is paged for a production incident involving that service. They open the monitoring dashboard and find nothing. No metrics, no alerts, no logs aggregation, no health endpoint. The service has been running in production for three weeks without anyone being able to tell whether it was healthy.

The problem is not that engineers forgot. It is that nothing prevented shipping without it. “Ready to deploy” means the feature is complete and tests pass. It does not mean the service exposes a health endpoint, publishes metrics to the monitoring system, has alerts configured for error rate and latency, or appears in the on-call runbook. These are treated as optional improvements to add later, and later rarely comes.

As the team owns more services, the operational burden grows unevenly. Some services have mature observability built over years of incidents. Others are invisible. On-call engineers learn which services are opaque and dread incidents that involve them. The services most likely to cause undiscovered problems are exactly the ones hardest to observe when problems occur.

Common causes

Blind operations

When observability is not a team-wide practice and value, it does not get built into new services by default. Services are built to the standard in place when they were written. If the team did not have a culture of shipping with health checks and alerting, early services were shipped without them. Each new service follows the existing pattern.

Establishing observability as a first-class delivery requirement - part of the definition of done for any service - ensures that new services ship with production readiness built in rather than bolted on after the first incident. The situation where a service runs unmonitored in production for weeks stops occurring because no service can reach production without meeting the standard.

Read more: Blind operations

Missing deployment pipeline

A pipeline can enforce deployment standards as a condition of promotion to production. A pipeline stage that checks for a functioning health endpoint, at least one defined alert, and the service appearing in the runbook prevents services from bypassing the standard. When the check fails, the deployment fails, and the engineer must add the missing observability before proceeding.

Without this gate in the pipeline, observability requirements are advisory. Engineers who are under deadline pressure deploy without meeting them. The standard becomes aspirational rather than enforced.

Read more: Missing deployment pipeline

How to narrow it down

  1. Does the deployment pipeline check for a functioning health endpoint before production deployment? If not, services can ship without health checks and nobody will know until an incident. Start with Missing deployment pipeline.
  2. Does the team have an explicit standard for what a service needs before it goes to production? If the standard does not exist or is not enforced, services will reflect individual engineer habits rather than a team baseline. Start with Blind operations.
  3. Are there services in production with no associated alerts? If yes, those services will cause incidents that the team discovers from user reports rather than monitoring. Start with Blind operations.

Ready to fix this? The most common cause is Blind operations. Start with its How to Fix It section for week-by-week steps.

2.19 - Staging Passes but Production Fails

Deployments pass every pre-production check but break when they reach production.

What you are seeing

Code passes tests, QA signs off, staging looks fine. Then the release hits production and something breaks: a feature behaves differently, a dependent service times out, or data that never appeared in staging triggers an unhandled edge case.

The team scrambles to roll back or hotfix. Confidence in the pipeline drops. People start adding more manual verification steps, which slows delivery without actually preventing the next surprise.

Common causes

Snowflake Environments

When each environment is configured by hand (or was set up once and has drifted since), staging and production are never truly the same. Different library versions, different environment variables, different network configurations. Code that works in one context silently fails in another because the environments are only superficially similar.

Read more: Snowflake Environments

Blind Operations

Sometimes the problem is not that staging passes and production fails. It is that production failures go undetected until a customer reports them. Without monitoring and alerting, the team has no way to verify production health after a deploy. “It works in staging” becomes the only signal, and production problems surface hours or days late.

Read more: Blind Operations

Tightly Coupled Monolith

Hidden dependencies between components mean that a change in one area affects behavior in another. In staging, these interactions may behave differently because the data is smaller, the load is lighter, or a dependent service is stubbed. In production, the full weight of real usage exposes coupling the team did not know existed.

Read more: Tightly Coupled Monolith

Manual Deployments

When deployment involves human steps (running scripts by hand, clicking through a console, copying files), the process is never identical twice. A step skipped in staging, an extra configuration applied in production, a different order of operations. The deployment itself becomes a source of variance between environments.

Read more: Manual Deployments

How to narrow it down

  1. Are your environments provisioned from the same infrastructure code? If not, or if you are not sure, start with Snowflake Environments.
  2. How did you discover the production failure? If a customer or support team reported it rather than an automated alert, start with Blind Operations.
  3. Does the failure involve a different service or module than the one you changed? If yes, the issue is likely hidden coupling. Start with Tightly Coupled Monolith.
  4. Is the deployment process identical and automated across all environments? If not, start with Manual Deployments.

Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.


2.20 - Deploying Stateful Services Causes Outages

Services holding in-memory state drop connections, lose sessions, or cause cache invalidation spikes on every redeployment.

What you are seeing

Deploying the session service drops active user sessions. Deploying the WebSocket server disconnects every connected client. Deploying the in-memory cache causes a cold-start period where every request misses cache for the next thirty minutes. The team knows which services are stateful and has developed rituals around deploying them: off-peak deployment windows, user notifications, manual drain procedures, runbooks specifying exact steps.

The rituals work until they do not. Someone deploys without the drain procedure because it was not enforced. A hotfix has to go out on a Tuesday afternoon because a security vulnerability was disclosed. The “we only deploy stateful services on weekends” policy conflicts with “we need to fix this now.” Users notice.

The underlying issue is that the deployment process does not account for the service’s stateful nature. There is no automated drain, no graceful shutdown that allows in-flight requests to complete, no mechanism for the new instance to warm up before the old one is terminated. The service was designed and deployed with no thought given to how it would be upgraded without interruption.

Common causes

Manual deployments

Stateful service deployments require precise sequencing: drain connections, allow in-flight requests to complete, terminate the old instance, start the new one, allow it to warm up before accepting traffic. Manual deployments rely on humans executing this sequence correctly under time pressure, from memory, without making mistakes.

Automated deployment pipelines that include graceful shutdown hooks, configurable drain timeouts, and health check gates before traffic routing eliminate the human sequencing requirement. The procedure is defined once, tested in lower environments, and executed consistently in production. Deployments that previously caused dropped sessions or cold-start spikes complete without service interruption because the sequencing is never skipped.

Read more: Manual deployments

Missing deployment pipeline

A pipeline can enforce graceful shutdown logic, connection drain periods, and health check gates as part of every deployment. Blue-green deployments - starting the new instance alongside the old one, waiting for it to become healthy, then shifting traffic - eliminate the downtime window entirely for stateless services and reduce it dramatically for stateful ones.

Without a pipeline, each deployment is a custom procedure executed by the operator on duty. The procedure may exist in a runbook, but runbooks are not enforced - they are consulted selectively and executed inconsistently.

Read more: Missing deployment pipeline

Snowflake environments

When staging environments do not replicate the stateful characteristics of production - connection volumes, session counts, cache sizes, WebSocket concurrency - the drain procedure validated in staging does not reliably translate to production behavior. A drain that completes in 30 seconds in staging may take 10 minutes in production under load.

Environments that match production in scale and configuration allow stateful deployment procedures to be validated with confidence. The drain timing is calibrated to real traffic patterns, so the procedure that completes cleanly in staging also completes cleanly in production - and deployments stop causing outages that only surface under real load.

Read more: Snowflake environments

How to narrow it down

  1. Is there an automated drain and graceful shutdown procedure for stateful services? If drain is manual or undocumented, the deployment will cause interruptions whenever the procedure is not followed perfectly. Start with Manual deployments.
  2. Does the pipeline include health check gates before routing traffic to the new instance? If traffic switches before the new instance is healthy, users hit the new instance while it is still warming up. Start with Missing deployment pipeline.
  3. Do staging environments match production in connection volume and load characteristics? If not, drain timing and warm-up behavior validated in staging will not generalize. Start with Snowflake environments.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

2.21 - Features Must Wait for a Separate QA Team Before Shipping

Work is complete from the development team’s perspective but cannot ship until a separate QA team tests and approves it. QA has its own queue and schedule.

What you are seeing

Development marks a story done. It moves to a “ready for QA” column and waits. The QA team has its own sprint, its own backlog, and its own capacity constraints. The feature sits for three days before a QA engineer picks it up. Testing takes another two days. Feedback arrives a week after development completed. The developer has moved on to other work and has to reload context to address the comments.

Near release time, QA becomes a bottleneck. Many features arrive at once, QA capacity cannot absorb them all, and some features are held over to the next release. Defects found late in QA are more expensive to fix because other work has been built on top of the untested code. The team’s release dates become determined by QA queue depth, not by development completion.

Common causes

Siloed QA Team

When quality assurance is a separate team rather than a shared practice embedded in development, testing becomes a handoff rather than a continuous activity. Developers write code and hand it to QA. QA tests it and hands defects back. The two teams operate on different cadences. Because quality is seen as QA’s responsibility, developers write less thorough tests of their own - why duplicate the effort? The siloed structure makes late testing the structural default rather than an avoidable outcome.

Read more: Siloed QA Team

QA Signoff as a Release Gate

When QA sign-off is a formal gate that must be passed before any release, the gate creates a queue. Features arrive at the gate in batches. QA must process all of them before anything ships. If QA finds a defect, the release waits while it is fixed and retested. The gate structure means quality problems are found late, in large batches, making them expensive to fix and disruptive to release schedules.

Read more: QA Signoff as a Release Gate

How to narrow it down

  1. Is there a “waiting for QA” column on the board, and do items spend days there? If work regularly accumulates waiting for QA to pick it up, the team has a handoff bottleneck rather than a continuous quality practice. Start with Siloed QA Team.
  2. Can the team deploy without QA sign-off? If QA approval is a required step before any production release, the gate creates batch testing and late defect discovery. Start with QA Signoff as a Release Gate.

Ready to fix this? The most common cause is Siloed QA Team. Start with its How to Fix It section for week-by-week steps.


3 - Integration and Feedback Problems

Symptoms related to work-in-progress, integration pain, review bottlenecks, and feedback speed.

These symptoms indicate problems with how work flows through your team. When integration is deferred, feedback is slow, or work piles up, the team stays busy without finishing things. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

Browse by category

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Team Workflow Anti-Patterns, Branching and Integration Anti-Patterns

Related guides: Trunk-Based Development, Work Decomposition, Limiting WIP

3.1 - Integration and Pipeline Problems

Code integration, merging, pipeline speed, and feedback loop problems.

Symptoms related to how code gets integrated, how the pipeline processes changes, and how fast the team gets feedback.

3.1.1 - Every Change Rebuilds the Entire Repository

A single repository with multiple applications and no selective build tooling. Any commit triggers a full rebuild of everything.

What you are seeing

The CI build takes 45 minutes for every commit because the pipeline rebuilds every application and runs every test regardless of what changed. The team chose a monorepo for good reasons - code sharing is simpler, cross-cutting changes are atomic, and dependency management is more coherent - but the pipeline has no awareness of what actually changed. Changing a comment in Service A triggers a full rebuild of Services B, C, D, and E.

Developers have adapted by batching changes to reduce the number of CI runs they wait through. One CI run per hour instead of one per commit. The batching reintroduces the integration problems the monorepo was supposed to solve: multiple changes combined in a single commit lose the ability to bisect failures to any individual change.

The build system treats the entire repository as a single unit. Service owners have added scripts to skip unmodified services, but the scripts are fragile and not consistently maintained. The CI system was not designed for selective builds, so every workaround is an unsupported hack on top of an ill-fitting tool.

Common causes

Missing deployment pipeline

Pipelines that understand which services changed - using build tools that model the dependency graph or change detection based on file paths - can selectively build and test only what was affected by a commit. Without this investment, pipelines treat the monorepo as a single unit and rebuild everything.

Tools like Nx, Bazel, or Turborepo provide dependency graph awareness for monorepos. A pipeline built on these tools builds only what needs to be rebuilt and runs only the tests that could be affected by the change. Feedback loops shorten from 45 minutes to 5.

Read more: Missing deployment pipeline

Manual deployments

When deployment is manual, there is no automated mechanism to determine which services changed and which need to be deployed. Manual review determines what to deploy, which is slow and inconsistent. Inconsistency leads to either over-deploying (deploying everything to be safe) or under-deploying (missing services that changed).

Automated deployment pipelines with change detection deploy exactly the services that changed, with evidence of what changed and why.

Read more: Manual deployments

How to narrow it down

  1. Does the pipeline build and test only the services affected by a change? If every commit triggers a full rebuild, change detection is not implemented. Start with Missing deployment pipeline.
  2. How long does a typical CI run take? If it takes more than 10 minutes regardless of what changed, the pipeline is not leveraging the monorepo’s dependency information. Start with Missing deployment pipeline.
  3. Can the team deploy a single service from the monorepo without triggering deployments of all services? If not, deployment automation does not understand the monorepo structure. Start with Manual deployments.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.1.2 - Feedback Takes Hours Instead of Minutes

The time from making a change to knowing whether it works is measured in hours, not minutes. Developers batch changes to avoid waiting.

What you are seeing

A developer makes a change and wants to know if it works. They push to CI and wait 45 minutes for the pipeline. Or they open a PR and wait two days for a review. Or they deploy to staging and wait for a manual QA pass that happens next week. By the time feedback arrives, the developer has moved on to something else.

The slow feedback changes developer behavior. They batch multiple changes into a single commit to avoid waiting multiple times. They skip local verification and push larger, less certain changes. They start new work before the previous change is validated, juggling multiple incomplete tasks.

When feedback finally arrives and something is wrong, the developer must context-switch back. The mental model from the original change has faded. Debugging takes longer because the developer is working from memory rather than from active context. If multiple changes were batched, the developer must untangle which one caused the failure.

Common causes

Inverted Test Pyramid

When most tests are slow E2E tests, the test feedback loop is measured in tens of minutes rather than seconds. Unit tests provide feedback in seconds. E2E tests take minutes or hours. A team with a fast unit test suite can verify a change in under a minute. A team whose testing relies on E2E tests cannot get feedback faster than those tests can run.

Read more: Inverted Test Pyramid

Integration Deferred

When the team does not integrate frequently (at least daily), the feedback loop for integration problems is as long as the branch lifetime. A developer working on a two-week branch does not discover integration conflicts until they merge. Daily integration catches conflicts within hours. Continuous integration catches them within minutes.

Read more: Integration Deferred

Manual Testing Only

When there are no automated tests, the only feedback comes from manual verification. A developer makes a change and must either test it manually themselves (slow) or wait for someone else to test it (slower). Automated tests provide feedback in the pipeline without requiring human effort or scheduling.

Read more: Manual Testing Only

Long-Lived Feature Branches

When pull requests wait days for review, the code review feedback loop dominates total cycle time. A developer finishes a change in two hours, then waits two days for review. The review feedback loop is 24 times longer than the development time. Long-lived branches produce large PRs, and large PRs take longer to review. Fast feedback requires fast reviews, which requires small PRs, which requires short-lived branches.

Read more: Long-Lived Feature Branches

Manual Regression Testing Gates

When every change must pass through a manual QA gate, the feedback loop includes human scheduling. The QA team has a queue. The change waits in line. When the tester gets to it, days have passed. Automated testing in the pipeline replaces this queue with instant feedback.

Read more: Manual Regression Testing Gates

How to narrow it down

  1. How fast can the developer verify a change locally? If the local test suite takes more than a few minutes, the test strategy is the bottleneck. Start with Inverted Test Pyramid.
  2. How frequently does the team integrate to main? If developers work on branches for days before integrating, the integration feedback loop is the bottleneck. Start with Integration Deferred.
  3. Are there automated tests at all? If the only feedback is manual testing, the lack of automation is the bottleneck. Start with Manual Testing Only.
  4. How long do PRs wait for review? If review turnaround is measured in days, the review process is the bottleneck. Start with Long-Lived Feature Branches.
  5. Is there a manual QA gate in the pipeline? If changes wait in a QA queue, the manual gate is the bottleneck. Start with Manual Regression Testing Gates.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.


3.1.3 - Merging Is Painful and Time-Consuming

Integration is a dreaded, multi-day event. Teams delay merging because it is painful, which makes the next merge even worse.

What you are seeing

A developer has been working on a feature branch for two weeks. They open a pull request and discover dozens of conflicts across multiple files. Other developers have changed the same areas of the codebase. Resolving the conflicts takes a full day. Some conflicts are straightforward (two people edited adjacent lines), but others are semantic (two people changed the same function’s behavior in different ways). The developer must understand both changes to merge correctly.

After resolving conflicts, the tests fail. The merged code compiles but does not work because the two changes are logically incompatible. The developer spends another half-day debugging the interaction. By the time the branch is merged, the developer has spent more time integrating than they spent building the feature.

The team knows merging is painful, so they delay it. The delay makes the next merge worse because more code has diverged. The cycle repeats until someone declares a “merge day” and the team spends an entire day resolving accumulated drift.

Common causes

Long-Lived Feature Branches

When branches live for weeks or months, they accumulate divergence from the main line. The longer the branch lives, the more changes happen on main that the branch does not include. At merge time, all of that divergence must be reconciled at once. A branch that is one day old has almost no conflicts. A branch that is two weeks old may have dozens.

Read more: Long-Lived Feature Branches

Integration Deferred

When the team does not practice continuous integration (integrating to main at least daily), each developer’s work diverges independently. The build may be green on each branch but broken when branches combine. CI means integrating continuously, not running a build server. Without frequent integration, merge pain is inevitable.

Read more: Integration Deferred

Monolithic Work Items

When work items are too large to complete in a day or two, developers must stay on a branch for the duration. A story that takes a week forces a week-long branch. Breaking work into smaller increments that can be integrated daily eliminates the divergence window that causes painful merges.

Read more: Monolithic Work Items

How to narrow it down

  1. How long do branches typically live before merging? If branches live longer than two days, the branch lifetime is the primary driver of merge pain. Start with Long-Lived Feature Branches.
  2. Does the team integrate to main at least once per day? If developers work in isolation for days before integrating, they are not practicing continuous integration regardless of whether a CI server exists. Start with Integration Deferred.
  3. How large are the typical work items? If stories take a week or more, the work decomposition forces long branches. Start with Monolithic Work Items.

Ready to fix this? The most common cause is Long-Lived Feature Branches. Start with its How to Fix It section for week-by-week steps.

3.1.4 - Each Language Has Its Own Ad Hoc Pipeline

Services in five languages with five build tools and no shared pipeline patterns. Each service is a unique operational snowflake.

What you are seeing

The Java service has a Jenkins pipeline set up four years ago. The Python service has a GitHub Actions workflow written by a consultant. The Go service has a Makefile. The Node.js service deploys from a developer’s laptop. The Ruby service has no deployment automation at all. Each service is a different discipline, maintained by whoever last touched it.

Onboarding a new engineer requires learning five different deployment systems. Fixing a security vulnerability in the dependency scanning step requires five separate changes across five pipeline definitions, each with different syntax. A compliance requirement that all services log deployment events requires five separate implementations, each time reinventing the pattern.

The team knows consolidation would help but cannot agree on a standard. The Java developers prefer their workflow. The Python developers prefer theirs. The effort to migrate any service to a common pattern feels risky because the current approach, however ad hoc, is known to work.

Common causes

Missing deployment pipeline

Without an organizational standard for pipeline design, each team or individual who sets up a service makes an independent choice based on personal familiarity. Establishing a standard pipeline pattern - even a minimal one - gives new services a starting point and gives existing services a target to migrate toward. Each service that adopts the standard is one fewer ad hoc pipeline to maintain separately.

Read more: Missing deployment pipeline

Knowledge silos

Each pipeline is understood only by the person who built it. Changes require that person. Debugging requires that person. When that person leaves, the pipeline becomes a black box that nobody wants to touch. The knowledge of “how the Ruby service deploys” is not shared across the team.

When pipeline patterns are standardized and documented, any team member can understand, debug, and improve any service’s pipeline. The knowledge is in the pattern, not in the person.

Read more: Knowledge silos

Manual deployments

Services that start with manual deployment accumulate automation piecemeal, in whatever form the person adding automation prefers. Without a standard, each automation effort produces a different result. The accumulation of five different automation approaches is harder to maintain than one standard approach applied to five services.

Read more: Manual deployments

How to narrow it down

  1. Does the team have a standard pipeline pattern that all services follow? If each service has a unique pipeline structure, start with establishing the standard. Start with Missing deployment pipeline.
  2. Can any engineer on the team deploy any service? If deploying a specific service requires the person who set it up, the pipeline knowledge is siloed. Start with Knowledge silos.
  3. Are there services with no deployment automation at all? Start with those services. Start with Manual deployments.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.1.5 - Pull Requests Sit for Days Waiting for Review

Pull requests queue up and wait. Authors have moved on by the time feedback arrives.

What you are seeing

A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat. Eventually, comments arrive, but the author has moved on to something else and has to reload context to respond. Another round of comments. Another wait. The PR finally merges two or three days after it was opened.

The team has five or more open PRs at any time. Some are days old. Developers start new work while they wait, which creates more PRs, which creates more review load, which slows reviews further.

Common causes

Long-Lived Feature Branches

When developers work on branches for days, the resulting PRs are large. Large PRs take longer to review because reviewers need more time to understand the scope of the change. A 300-line PR is daunting. A 50-line PR takes 10 minutes. The branch length drives the PR size, which drives the review delay.

Read more: Long-Lived Feature Branches

Knowledge Silos

When only specific individuals can review certain areas of the codebase, those individuals become bottlenecks. Their review queue grows while other team members who could review are not considered qualified. The constraint is not review capacity in general but review capacity for specific code areas concentrated in too few people.

Read more: Knowledge Silos

Push-Based Work Assignment

When work is assigned to individuals, reviewing someone else’s code feels like a distraction from “my work.” Every developer has their own assigned stories to protect. Helping a teammate finish their work by reviewing their PR competes with the developer’s own assignments. The incentive structure deprioritizes collaboration.

Read more: Push-Based Work Assignment

How to narrow it down

  1. Are PRs larger than 200 lines on average? If yes, the reviews are slow because the changes are too large to review quickly. Start with Long-Lived Feature Branches and the work decomposition that feeds them.
  2. Are reviews waiting on specific individuals? If most PRs are assigned to or waiting on one or two people, the team has a knowledge bottleneck. Start with Knowledge Silos.
  3. Do developers treat review as lower priority than their own coding work? If yes, the team’s norms do not treat review as a first-class activity. Start with Push-Based Work Assignment and establish a team working agreement that reviews happen before starting new work.

Ready to fix this? The most common cause is Long-Lived Feature Branches. Start with its How to Fix It section for week-by-week steps.

3.1.6 - The Team Resists Merging to the Main Branch

Developers feel unsafe committing to trunk. Feature branches persist for days or weeks before merge.

What you are seeing

Everyone still has long-lived feature branches. The team agreed to try trunk-based development, but three sprints later “merge to trunk when the feature is done” is the informal rule. Branches live for days or weeks. When developers finally merge, there are conflicts. The conflicts take hours to resolve. Everyone agrees this is a problem but nobody knows how to break the cycle.

The core objection is safety: “I’m not going to push half-finished code to main.” This is a reasonable concern in the current environment. The main branch has no automated test suite that would catch regressions quickly. There is no feature flag infrastructure to let partially-built features live in production in a dormant state. Trunk-based development feels reckless because the prerequisites for it are not in place.

The team is not wrong to feel unsafe. They are wrong to believe long-lived branches are safer. The longer a branch lives, the larger the eventual merge, the more conflicts, and the more risk concentrated into the merge event. The fear of merging to trunk is rational, but the response makes the underlying problem worse.

Common causes

Manual testing only

Without a fast automated test suite, merging to trunk means accepting unknown risk. Developers protect themselves by deferring the merge until they have done sufficient manual verification - which takes days. Teams with a fast automated suite that runs in minutes find the resistance dissolves. When a broken commit is caught in five minutes, committing to trunk stops feeling reckless and starts feeling like the obvious way to work.

Read more: Manual testing only

Manual regression testing gates

When a manual QA phase gates each release, trunk is never truly releasable. Merging to trunk does not mean the code is production-ready - it still has to pass manual testing. This reduces the psychological pressure to keep trunk releasable. The team does not feel the cost of a broken trunk immediately because it is not the signal they monitor.

When trunk is the thing that gates production, a broken trunk is a fire drill - every minute it is broken is a minute the team cannot ship. That urgency is what makes developers take frequent integration seriously. Without it, the resistance to committing to trunk has no natural counter-pressure.

Read more: Manual regression testing gates

Long-lived feature branches

Feature branch habits are self-reinforcing. Teams with ingrained feature branch practices have calibrated their workflows, tools, and feedback loops to the batching model. Switching to trunk-based development requires changing all of those workflows simultaneously, which is disorienting.

The habits that make long-lived branches feel safe - waiting to merge until the feature is complete, doing final testing on the branch, getting full review before touching trunk - are the same habits that keep the resistance alive. Small, deliberate workflow changes - reviewing smaller units, integrating while work is in progress, getting feedback from the pipeline rather than a gated review - reduce the resistance step by step rather than requiring an all-at-once mindset shift.

Read more: Long-lived feature branches

Monolithic work items

Large work items cannot be integrated to trunk incrementally without deliberate design. A story that takes three weeks requires either keeping a branch for three weeks, or learning to hide in-progress work behind feature flags, dark launch patterns, or abstraction layers. Without those techniques, large items force long-lived branches.

Decomposing work into smaller items that can be integrated to trunk in a day or two makes trunk-based development natural rather than effortful.

Read more: Monolithic work items

How to narrow it down

  1. Does the team have an automated test suite that runs in under 10 minutes? If not, the feedback loop needed to make frequent trunk commits safe does not exist. Start with Manual testing only.
  2. Is trunk always releasable? If releases require a manual QA phase regardless of trunk state, there is no incentive to keep trunk releasable. Start with Manual regression testing gates.
  3. Do work items typically take more than two days to complete? If items take longer than two days, integrating to trunk daily requires techniques for hiding in-progress work. Start with Monolithic work items.

Ready to fix this? The most common cause is Long-lived feature branches. Start with its How to Fix It section for week-by-week steps.

3.1.7 - Pipelines Take Too Long

Pipelines take 30 minutes or more. Developers stop waiting and lose the feedback loop.

What you are seeing

A developer pushes a commit and waits. Thirty minutes pass. An hour. The pipeline is still running. The developer context-switches to another task, and by the time the pipeline finishes (or fails), they have moved on mentally. If the build fails, they must reload context, figure out what went wrong, fix it, push again, and wait another 30 minutes.

Developers stop running the full test suite locally because it takes too long. They push and hope. Some developers batch multiple changes into a single push to avoid waiting multiple times, which makes failures harder to diagnose. Others skip the pipeline entirely for small changes and merge with only local verification.

The pipeline was supposed to provide fast feedback. Instead, it provides slow feedback that developers work around rather than rely on.

Common causes

Inverted Test Pyramid

When most of the test suite consists of end-to-end or integration tests rather than unit tests, the pipeline is dominated by slow, resource-intensive test execution. E2E tests launch browsers, spin up services, and wait for network responses. A test suite with thousands of unit tests (that run in seconds) and a small number of targeted E2E tests is fast. A suite with hundreds of E2E tests and few unit tests is slow by construction.

Read more: Inverted Test Pyramid

Snowflake Environments

When pipeline environments are not standardized or reproducible, builds include extra time for environment setup, dependency installation, and configuration. Caching is unreliable because the environment state is unpredictable. A pipeline that spends 15 minutes downloading dependencies because there is no reliable cache layer is slow for infrastructure reasons, not test reasons.

Read more: Snowflake Environments

Tightly Coupled Monolith

When the codebase has no clear module boundaries, every change triggers a full rebuild and a full test run. The pipeline cannot selectively build or test only the affected components because the dependency graph is tangled. A change to one module might affect any other module, so the pipeline must verify everything.

Read more: Tightly Coupled Monolith

Manual Regression Testing Gates

When the pipeline includes a manual testing phase, the wall-clock time from push to green includes human wait time. A pipeline that takes 10 minutes to build and test but then waits two days for manual sign-off is not a 10-minute pipeline. It is a two-day pipeline with a 10-minute automated prefix.

Read more: Manual Regression Testing Gates

How to narrow it down

  1. What percentage of pipeline time is spent running tests? If test execution dominates and most tests are E2E or integration tests, the test strategy is the bottleneck. Start with Inverted Test Pyramid.
  2. How much time is spent on environment setup and dependency installation? If the pipeline spends significant time on infrastructure before any tests run, the build environment is the bottleneck. Start with Snowflake Environments.
  3. Can the pipeline build and test only the changed components? If every change triggers a full rebuild, the architecture prevents selective testing. Start with Tightly Coupled Monolith.
  4. Does the pipeline include any manual steps? If a human must approve or act before the pipeline completes, the human is the bottleneck. Start with Manual Regression Testing Gates.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

3.1.8 - The Team Is Caught Between Shipping Fast and Not Breaking Things

A cultural split between shipping speed and production stability. Neither side sees how CD resolves the tension.

What you are seeing

The team is divided. Developers want to ship often and trust that fast feedback will catch problems. Operations and on-call engineers want stability and fewer changes to reason about during incidents. Both positions are defensible. The conflict is real and recurs in every conversation about deployment frequency, change windows, and testing requirements.

The team has reached an uncomfortable equilibrium. Developers batch changes to deploy less often, which partially satisfies the stability concern but creates larger, riskier releases. Operations accepts the change window constraints, which gives them predictability but means the team cannot respond quickly to urgent fixes. Nobody is getting what they actually want.

What neither side sees is that the conflict is a symptom of the current deployment system, not an inherent tradeoff. Deployments are risky because they are large and infrequent. They are large and infrequent because of the process and tooling around them. A system that makes deployments small, fast, automated, and reversible changes the equation: frequent small changes are less risky than infrequent large ones.

Common causes

Manual deployments

Manual deployments are slow and error-prone, which makes the stability concern rational. When deployments require hours of careful manual execution, limiting their frequency does reduce overall human error exposure. The stability faction’s instinct is correct given the current deployment mechanism.

Automated deployments that execute the same steps identically every time eliminate most human error from the deployment process. When the deployment mechanism is no longer a variable, the speed-vs-stability argument shifts from “how often should we deploy” to “how good is the code we are deploying” - a question both sides can agree on.

Read more: Manual deployments

Missing deployment pipeline

Without a pipeline with automated tests, health checks, and rollback capability, the stability concern is valid. Each deployment is a manual, unverified process that could go wrong in novel ways. A pipeline that enforces quality gates before production and detects problems immediately after deployment changes the risk profile of frequent deployments fundamentally.

When the team can deploy with high confidence and roll back automatically if something goes wrong, the frequency of deployments stops being a risk factor. The risk per deployment is low when each deployment is small, tested, and reversible.

Read more: Missing deployment pipeline

Pressure to skip testing

When testing is perceived as an obstacle to shipping speed, teams cut tests to go faster. This worsens stability, which intensifies the stability faction’s resistance to more frequent deployments. The speed-vs-stability tension is partly created by the belief that quality and speed are in opposition - a belief reinforced by the experience of shipping faster by skipping tests and then dealing with the resulting production incidents.

Read more: Pressure to skip testing

Deadline-driven development

When velocity is measured by features shipped to a deadline, every hour spent on test infrastructure, deployment automation, or operational excellence is an hour not spent on the deadline. The incentive structure creates the tension by rewarding speed while penalizing the investment that would make speed safe.

Read more: Deadline-driven development

How to narrow it down

  1. Is the deployment process automated and consistent? If deployments are manual and variable, the stability concern is about process risk, not just code risk. Start with Manual deployments.
  2. Does the team have automated testing and fast rollback? Without these, deploying frequently is genuinely riskier than deploying infrequently. Start with Missing deployment pipeline.
  3. Does management pressure the team to ship faster by cutting testing? If yes, the tension is being created from above rather than within the team. Start with Pressure to skip testing.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

3.2 - Work Management and Flow Problems

WIP overload, cycle time, planning bottlenecks, and dependency coordination problems.

Symptoms related to how work is planned, prioritized, and moved through the delivery process.

3.2.1 - Blocked Work Sits Idle Instead of Being Picked Up

When a developer is stuck, the item waits with them rather than being picked up by someone else. The team has no mechanism for redistributing blocked work.

What you are seeing

A developer opens a ticket on Monday and hits a blocker by Tuesday - a missing dependency, an unclear requirement, an area of the codebase they don’t understand well. They flag it in standup. The item sits in “in progress” for two more days while they work around the blocker or wait for it to resolve. Nobody picks it up.

The board shows items stuck in the same column for days. Blockers get noted but rarely acted on by other team members. At sprint review, several items are “almost done” but not finished - each stalled at a different blocker that a teammate could have resolved quickly.

Common causes

Push-Based Work Assignment

When work belongs to an assigned individual, nobody else feels authorized to touch it. Other team members see the blocked item but do not pick it up because it is “someone else’s story.” The assigned developer is expected to resolve their own blockers, even when a teammate could clear the issue in minutes. The team’s norm is individual ownership, so swarming - the highest-value response to a blocker - never happens.

Read more: Push-Based Work Assignment

Knowledge Silos

When only the assigned developer understands the relevant area of the codebase, other team members cannot help even when they want to. The blocker persists until the assigned person resolves it because nobody else has the context to take over. Swarming is not possible because the knowledge needed to continue the work lives in one person.

Read more: Knowledge Silos

How to narrow it down

  1. Does the blocked item sit with the assigned developer rather than being picked up by someone else? If teammates see the blocker flagged in standup and do not act on it, the norm of individual ownership is preventing swarming. Start with Push-Based Work Assignment.
  2. Could a teammate help if they had more context about that area of the codebase? If knowledge is too concentrated to allow handoff, silos are compounding the problem. Start with Knowledge Silos.

Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.


3.2.2 - Completed Stories Don't Match What Was Needed

Stories are marked done but rejected at review. The developer built what the ticket described, not what the business needed.

What you are seeing

A developer finishes a story and moves it to done. The product owner reviews it and sends it back: “This isn’t quite what I meant.” The implementation is technically correct - it satisfies the acceptance criteria as written - but it misses the point of the work. The story re-enters the sprint as rework, consuming time that was not planned for.

This happens repeatedly with the same pattern: the developer built exactly what was described in the ticket, but the ticket did not capture the underlying need. Stories that seemed clearly defined come back with significant revisions. The team’s velocity looks reasonable but a meaningful fraction of that work is being done twice.

Common causes

Push-Based Work Assignment

When work is assigned rather than pulled, the developer receives a ticket without the context behind it. They were not in the conversation where the need was identified, the priority was established, or the trade-offs were discussed. They implement the ticket as written and deliver something that satisfies the description but not the intent.

In a pull system, developers engage with the backlog before picking up work. Refinement discussions and Three Amigos sessions happen with the people who will actually do the work, not with whoever happens to be assigned later. The developer who pulls a story understands why it is at the top of the backlog and what outcome it is trying to achieve.

Read more: Push-Based Work Assignment

Ambiguous Requirements

When acceptance criteria are written as checklists rather than as descriptions of user outcomes, they can be satisfied without delivering value. A story that specifies “add a confirmation dialog” can be implemented in a way that technically adds the dialog but makes it unusable. Requirements that do not express the user’s goal leave room for implementations that miss the point.

Read more: Work Decomposition

How to narrow it down

  1. Did the developer have any interaction with the product owner or user before starting the story? If the developer received only a ticket with no conversation about context or intent, the assignment model is isolating them from the information they need. Start with Push-Based Work Assignment.
  2. Are the acceptance criteria expressed as user outcomes or as implementation checklists? If criteria describe what to build rather than what the user should be able to do, the requirements do not encode intent. Start with Work Decomposition and look at how stories are written and refined.

Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.


3.2.3 - Stakeholders See Working Software Only at Release Time

There is no cadence for incremental demos. Feedback on what was built arrives months after decisions were made.

What you are seeing

Stakeholders do not see working software until a feature is finished. The team works for six weeks on a new feature, demonstrates it at the sprint review, and the response is: “This is good, but what we actually needed was slightly different. Can we change the navigation so it does X? And actually, we do not need this section at all.” Six weeks of work needs significant rethinking. The changes are scoped as follow-on work for the next planning cycle.

The problem is not that stakeholders gave bad requirements. It is that requirements look different when demonstrated as working software rather than described in user stories. Stakeholders genuinely did not know what they wanted until they saw what they said they wanted. This is normal and expected. The system that would make this feedback cheap - frequent demonstrations of small working increments - is not in place.

When stakeholder feedback arrives months after decisions, course corrections are expensive. Architecture that needs to change has been built on top of for months. The initial decisions have become load-bearing walls. Rework is disproportionate to the insight that triggered it.

Common causes

Monolithic work items

Large work items are not demonstrable until they are complete. A feature that takes six weeks cannot be shown incrementally because it is not useful in partial form. Stakeholders see nothing for six weeks and then see everything at once.

Small vertical slices can be demonstrated as soon as they are done - sometimes multiple times per week. Each slice is a unit of working, demonstrable software that stakeholders can evaluate and respond to while the team is still in the context of that work.

Read more: Monolithic work items

Horizontal slicing

When work is organized by technical layer, nothing is demonstrable until all layers are complete. An API layer with no UI and a UI component that calls no API are both invisible to stakeholders. The feature exists in pieces that stakeholders cannot evaluate individually.

Vertical slices deliver thin but complete functionality that stakeholders can actually use. Each slice has a visible outcome rather than a technical contribution to a future visible outcome.

Read more: Horizontal slicing

Undone work

When the definition of “done” does not include deployed and available for stakeholder review, work piles up as “done but not shown.” The sprint review demonstrates a batch of completed work rather than continuously integrated increments. The delay between completion and review is the source of the feedback lag.

When done means deployed - and the team can demonstrate software in a production-like environment at any sprint review - the feedback loop tightens to the sprint cadence rather than the release cadence.

Read more: Undone work

Deadline-driven development

When delivery is organized around fixed dates rather than continuous value delivery, stakeholder checkpoints are scheduled at release boundaries. The mid-quarter check-in is a status update, not a demonstration of working software. Stakeholders’ ability to redirect the team’s work is limited to the brief window around each release.

Read more: Deadline-driven development

How to narrow it down

  1. Can the team demonstrate working software every sprint, not just at release? If demos require a release, work is batched too long. Start with Undone work.
  2. Do stories regularly take more than one sprint to complete? If features are too large to show incrementally, start with Monolithic work items.
  3. Are stories organized by technical layer? If the UI team and the API team must both finish before anything can be demonstrated, start with Horizontal slicing.

Ready to fix this? The most common cause is Monolithic work items. Start with its How to Fix It section for week-by-week steps.

3.2.4 - Sprint Planning Is Dominated by Dependency Negotiation

Teams can’t start work until another team finishes something. Planning sessions map dependencies rather than commit to work.

What you are seeing

Sprint planning takes hours. Half the time is spent mapping dependencies: Team A cannot start story X until Team B delivers API Y. Team B cannot deliver that until Team C finishes infrastructure work Z. The board fills with items in “blocked” status before the sprint begins. Developers spend Monday through Wednesday waiting for upstream deliverables and then rush everything on Thursday and Friday.

The dependency graph is not stable. It changes every sprint as new work surfaces new cross-team requirements. Planning sessions produce a list of items the team hopes to complete, contingent on factors outside their control. Commitments are made with invisible asterisks. When something slips - and something always slips - the team negotiates whether the miss was their fault or the fault of a dependency.

The structural problem is that teams are organized around technical components or layers rather than around end-to-end capabilities. A feature that delivers value to a user requires work from three teams because no single team owns the full stack for that capability. The teams are coupled by the feature, even if the architecture nominally separates them.

Common causes

Tightly coupled monolith

When services or components are tightly coupled, changes to one require coordinated changes in others. A change to the data model requires the API team to update their queries, which requires the frontend team to update their calls. Teams working on different parts of a tightly coupled system cannot proceed independently because the code does not allow it.

Decomposed systems with stable interfaces allow teams to work against contracts rather than against each other’s code. When an interface is stable, the consuming team can proceed without waiting for the providing team to finish. The items that spent a sprint sitting in “blocked” status start moving again because the code no longer requires the other team to act first.

Read more: Tightly coupled monolith

Distributed monolith

Services that are nominally independent but require coordinated deployment create the same dependency patterns as a monolith. Teams that own different services in a distributed monolith cannot ship independently. Every feature delivery is a joint operation involving multiple teams whose services must change and deploy together.

Services that are genuinely independent can be changed, tested, and deployed without coordination. True service independence is a prerequisite for team independence. Sprint planning stops being a dependency negotiation session when each team’s services can ship without waiting on another team’s deployment schedule.

Read more: Distributed monolith

Horizontal slicing

When teams are organized by technical layer - front end, back end, database - every user-facing feature requires coordination across all teams. The frontend team needs the API before they can build the UI. The API team needs the database schema before they can write the queries. No team can deliver a complete feature independently.

Organizing teams around vertical slices of capability - a team that owns the full stack for a specific domain - eliminates most cross-team dependencies. The team that owns the feature can deliver it without waiting on other teams.

Read more: Horizontal slicing

Monolithic work items

Large work items have more opportunities to intersect with other teams’ work. A story that takes one week and touches the data layer, the API layer, and the UI layer requires coordination with three teams at three different times. Smaller items scoped to a single layer or component can often be completed within one team without external dependencies.

Decomposing large items into smaller, more self-contained pieces reduces the surface area of cross-team interaction. Even when teams remain organized by layer, smaller items spend less time in blocked states.

Read more: Monolithic work items

How to narrow it down

  1. Does changing one team’s service require changing another team’s service? If interface changes cascade across teams, the services are coupled. Start with Tightly coupled monolith.
  2. Must multiple services deploy simultaneously to deliver a feature? If services cannot be deployed independently, the architecture is the constraint. Start with Distributed monolith.
  3. Does each team own only one technical layer? If no team can deliver end-to-end functionality, the organizational structure creates dependencies. Start with Horizontal slicing.
  4. Are work items frequently blocked waiting on another team’s deliverable? If items spend more time blocked than in progress, decompose items to reduce cross-team surface area. Start with Monolithic work items.

Ready to fix this? The most common cause is Tightly coupled monolith. Start with its How to Fix It section for week-by-week steps.

3.2.5 - Everything Started, Nothing Finished

The board shows many items in progress but few reaching done. The team is busy but not delivering.

What you are seeing

Open the team’s board on any given day. Count the items in progress. Count the team members. If the first number is significantly higher than the second, the team has a WIP problem. Every developer is working on a different story. Eight items in progress, zero done. Nothing gets the focused attention needed to finish.

At the end of the sprint, there is a scramble to close anything. Stories that were “almost done” for days finally get pushed through. Cycle time is long and unpredictable. The team is busy all the time but finishes very little.

Common causes

Push-Based Work Assignment

When managers assign work to individuals rather than letting the team pull from a prioritized backlog, each person ends up with their own queue of assigned items. WIP grows because work is distributed across individuals rather than flowing through the team. Nobody swarms on blocked items because everyone is busy with “their” assigned work.

Read more: Push-Based Work Assignment

Horizontal Slicing

When work is split by technical layer (“build the database schema,” “build the API,” “build the UI”), each layer must be completed before anything is deployable. Multiple developers work on different layers of the same feature simultaneously, all “in progress,” none independently done. WIP is high because the decomposition prevents any single item from reaching completion quickly.

Read more: Horizontal Slicing

Unbounded WIP

When the team has no explicit constraint on how many items can be in progress simultaneously, there is nothing to prevent WIP from growing. Developers start new work whenever they are blocked, waiting for review, or between tasks. Without a limit, the natural tendency is to stay busy by starting things rather than finishing them.

Read more: Unbounded WIP

How to narrow it down

  1. Does each developer have their own assigned backlog of work? If yes, the assignment model prevents swarming and drives individual queues. Start with Push-Based Work Assignment.
  2. Are work items split by technical layer rather than by user-visible behavior? If yes, items cannot be completed independently. Start with Horizontal Slicing.
  3. Is there any explicit limit on how many items can be in progress at once? If no, the team has no mechanism to stop starting and start finishing. Start with Unbounded WIP.

Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.

3.2.6 - Vendor Release Cycles Constrain the Team's Deployment Frequency

Upstream systems deploy quarterly or downstream consumers require advance notice. External constraints set the team’s release schedule.

What you are seeing

The team is ready to deploy. But the upstream payment provider releases their API once a quarter and the new version the team depends on is not live yet. Or the downstream enterprise consumer the team integrates with requires 30 days advance notice before any API change goes live. The team’s own deployment readiness is irrelevant - external constraints set the schedule.

The team adapts by aligning their release cadence with their most constraining external dependency. If one vendor deploys quarterly, the team deploys quarterly. Every advance the team makes in internal deployment speed is nullified by the external constraint. The most sophisticated internal pipeline in the world still produces a team that ships four times per year.

Some external constraints are genuinely fixed. A payment network’s settlement schedule, regulatory reporting requirements, hardware firmware update cycles - these cannot be accelerated. But many “external” constraints turn out to be negotiable, workaroundable through abstraction, or simply assumed to be fixed without ever being tested.

Common causes

Tightly coupled monolith

When the team’s system is tightly coupled to third-party systems at the technical level, any change to either side requires coordinated deployment. The integration code is tightly bound to specific vendor API versions, specific response shapes, specific timing assumptions. Wrapping third-party integrations in adapter layers creates the abstraction needed to deploy the team’s side independently.

An adapter that isolates the team’s code from vendor-specific details can handle multiple API versions simultaneously. The team can deploy their adapter update, leaving the old vendor path active until the vendor’s new version is available, then switch.

Read more: Tightly coupled monolith

Distributed monolith

When the team’s services must be deployed in coordination with other systems - whether internal or external - the coupling forces joint releases. Each deployment event becomes a multi-party coordination exercise. The team cannot ship independently because their services are not actually independent.

Services that expose stable interfaces and handle both old and new protocol versions simultaneously can be deployed and upgraded without coordinating with consumers. That interface stability is what removes the external constraint: the team can ship on their own schedule because changing one side no longer requires the other side to change at the same time.

Read more: Distributed monolith

Missing deployment pipeline

Without a pipeline, there is no mechanism for gradual migrations - running old and new integration paths simultaneously during a transition period. Switching to a new vendor API requires deploying new code that breaks old behavior unless both paths are maintained in parallel.

A pipeline with feature flag support can activate the new vendor integration for a subset of traffic, validate it against real load, and then complete the migration when confidence is established. This decouples the team’s deployment from the vendor’s release schedule.

Read more: Missing deployment pipeline

How to narrow it down

  1. Is the team’s code tightly bound to specific vendor API versions? If the integration cannot handle multiple vendor versions simultaneously, every vendor change requires a coordinated deployment. Start with Tightly coupled monolith.
  2. Must the team coordinate deployment timing with external parties? If yes, the interfaces between systems do not support independent deployment. Start with Distributed monolith.
  3. Can the team run old and new integration paths simultaneously? If switching to a new vendor version is a hard cutover, the pipeline does not support gradual migration. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Tightly coupled monolith. Start with its How to Fix It section for week-by-week steps.

3.2.7 - Services in the Same Portfolio Have Wildly Different Maturity Levels

Some services have full pipelines and coverage. Others have no tests and are deployed manually. No consistent baseline exists.

What you are seeing

Some services have full pipelines, comprehensive test coverage, automated deployment, and monitoring dashboards. Others have no tests, no pipeline, and are deployed by copying files onto a server. Both sit in the same team’s portfolio. The team’s CD practices apply to the modern ones. The legacy ones exist outside them.

Improving the legacy services feels impossible to prioritize. They are not blocking any immediate feature work. The incidents they cause are infrequent enough to accept. Adding tests, setting up a pipeline, and improving the deployment process are multi-week investments with no immediate visible output. They compete for sprint capacity against features that have product owners and deadlines.

The maturity gap widens over time. The modern services get more capable as the team’s CD practices improve. The legacy ones stay frozen. Eventually they represent a liability: they cannot benefit from any of the team’s improved practices, they are too risky to touch, and they handle increasingly critical functionality as other services are modernized around them.

Common causes

Missing deployment pipeline

Services without pipelines cannot participate in the team’s CD practices. The pipeline is the foundation on which automated testing, deployment automation, and observability build. A service with no pipeline is a service that will always require manual attention for every change.

Establishing a minimal viable pipeline for every service - even if it just runs existing tests and provides a deployment command - closes the gap between the modern services and the legacy ones. A service with even a basic pipeline can participate in the team’s practices and improve from there; a service with no pipeline cannot improve at all.

Read more: Missing deployment pipeline

Thin-spread teams

Teams spread across too many services and responsibilities cannot allocate the focused investment needed to bring lower-maturity services up to standard. Each sprint, the urgency of visible work displaces the sustained effort that improvement requires. Investment in a legacy service delivers no value for weeks before the improvement becomes visible.

Teams with appropriate scope relative to capacity can allocate improvement time in each sprint. A team that owns two services instead of six can invest in both. A team that owns six has to accept that four will be neglected.

Read more: Thin-spread teams

How to narrow it down

  1. Does every service in the team’s portfolio have an automated deployment pipeline? If not, identify which services lack pipelines and why. Start with Missing deployment pipeline.
  2. Does the team have time to improve services that are not actively producing incidents? If improvement work is always displaced by feature or incident work, the team is spread too thin. Start with Thin-spread teams.
  3. Are there services the team owns but is afraid to touch? Fear of touching a service is a strong indicator that the service lacks the safety nets (tests, pipeline, documentation) needed for safe modification.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.2.8 - Some Developers Are Overloaded While Others Wait for Work

Work is distributed unevenly across the team. Some developers are chronically overloaded while others finish early and wait for new assignments.

What you are seeing

Sprint planning ends with everyone assigned roughly the same number of story points. By midweek, two developers have finished their work and are waiting for something new, while three others are behind and working evenings to catch up. The imbalance repeats every sprint, but the people who are overloaded shift unpredictably.

At standup, some developers report being blocked or overwhelmed while others report nothing to do. Managers respond by reassigning work in flight, which disrupts both the giver and the receiver. The team’s throughput is limited by the most overloaded members even when others have capacity.

Common causes

Push-Based Work Assignment

When managers distribute work at sprint planning, they are estimating in advance how long each item will take and who is the right person for it. Those estimates are routinely wrong. Some items take twice as long as expected; others finish in half the time. Because work was pre-assigned, there is no mechanism for the team to self-balance. Fast finishers wait for new assignments while slow finishers fall behind, regardless of available team capacity.

In a pull system, workloads balance automatically: whoever finishes first pulls the next highest-priority item. No manager needs to predict durations or redistribute work mid-sprint.

Read more: Push-Based Work Assignment

Thin-Spread Teams

When a team is responsible for too many products or codebases, workload spikes in one area cannot be absorbed by people working in another. Each developer is already committed to their domain. The team cannot rebalance because work is siloed by system ownership rather than flowing to whoever has capacity.

Read more: Thin-Spread Teams

How to narrow it down

  1. Does work get assigned at sprint planning and rarely change hands afterward? If assignments are fixed at the start of the sprint and the team has no mechanism for rebalancing mid-sprint, the assignment model is the root cause. Start with Push-Based Work Assignment.
  2. Are developers unable to help with overloaded areas because they don’t know the codebase? If the team cannot rebalance because knowledge is siloed, people are locked into their assigned domain even when they have capacity. Start with Thin-Spread Teams and Knowledge Silos.

Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.


3.2.9 - Work Stalls Waiting for the Platform or Infrastructure Team

Teams cannot provision environments, update configurations, or access infrastructure without filing a ticket and waiting for a separate platform or ops team to act.

What you are seeing

A team needs a new environment for testing, a configuration value updated, a database instance provisioned, or a new service account created. They file a ticket. The platform team has its own backlog and prioritization process. The ticket sits for two days, then a week. The team’s sprint work is blocked until it is resolved. When the platform team delivers, there is a round of back-and-forth because the request was not specific enough, and the team waits again.

This happens repeatedly across different types of requests: compute resources, network access, environment variables, secrets, certificates, DNS entries. Each one is a separate ticket, a separate queue, a separate wait. Developers learn to front-load requests at the beginning of sprints to get ahead of the lead time, but the lead times shift and the requests still arrive too late.

Common causes

Separate Ops/Release Team

When infrastructure and platform work is owned by a separate team, developers have no path to self-service. Every infrastructure need becomes a cross-team request. The platform team is optimizing its own backlog, which may not align with the delivery team’s priorities. The structural separation means that the team doing the work and the team enabling the work have different schedules, different priorities, and different definitions of urgency.

Read more: Separate Ops/Release Team

No On-Call or Operational Ownership

When delivery teams do not own their infrastructure and operational concerns, they have no incentive or capability to build self-service tooling. The platform team owns the infrastructure and therefore controls access to it. Teams that own their own operations build automation and self-service interfaces because the cost of tickets falls on them. Teams that don’t own operations accept the ticket queue because there is no alternative.

Read more: No On-Call or Operational Ownership

How to narrow it down

  1. Does the team file tickets for infrastructure changes that should take minutes? If provisioning a test environment or updating a config value requires a cross-team request and a multi-day wait, the team lacks self-service capability. Start with Separate Ops/Release Team.
  2. Does the team own the operational concerns of what they build? If another team manages production, monitoring, and infrastructure for the delivery team’s services, the delivery team has no path to self-service. Start with No On-Call or Operational Ownership.

Ready to fix this? The most common cause is Separate Ops/Release Team. Start with its How to Fix It section for week-by-week steps.


3.2.10 - Work Items Take Days or Weeks to Complete

Stories regularly take more than a week from start to done. Developers go days without integrating.

What you are seeing

A developer picks up a work item on Monday. By Wednesday, they are still working on it. By Friday, it is “almost done.” The following Monday, they are fixing edge cases. The item finally moves to review mid-week as a 300-line pull request that the reviewer does not have time to look at carefully.

Cycle time is measured in weeks, not days. The team commits to work at the start of the sprint and scrambles at the end. Estimates are off by a factor of two because large items hide unknowns that only surface mid-implementation.

Common causes

Horizontal Slicing

When work is split by technical layer rather than by user-visible behavior, each item spans an entire layer and takes days to complete. “Build the database schema,” “build the API,” “build the UI” are each multi-day items. Nothing is deployable until all layers are done. Vertical slicing (cutting thin slices through all layers to deliver complete functionality) produces items that can be finished in one to two days.

Read more: Horizontal Slicing

Monolithic Work Items

When the team takes requirements as they arrive without breaking them into smaller pieces, work items are as large as the feature they describe. A ticket titled “Add user profile page” hides a login form, avatar upload, email verification, notification preferences, and password reset. Without a decomposition practice during refinement, items arrive at planning already too large to flow.

Read more: Monolithic Work Items

Long-Lived Feature Branches

When developers work on branches for days or weeks, the branch and the work item are the same size: large. The branching model reinforces large items because there is no integration pressure to finish quickly. Trunk-based development creates natural pressure to keep items small enough to integrate daily.

Read more: Long-Lived Feature Branches

Push-Based Work Assignment

When work is assigned to individuals, swarming is not possible. If the assigned developer hits a blocker - a dependency, an unclear requirement, a missing skill - they work around it alone rather than asking for help. Asking for help means pulling a teammate away from their own assigned work, so developers hesitate. Items sit idle while the assigned person waits or context-switches rather than the team collectively resolving the blocker.

Read more: Push-Based Work Assignment

How to narrow it down

  1. Are work items split by technical layer? If the board shows items like “backend for feature X” and “frontend for feature X,” the decomposition is horizontal. Start with Horizontal Slicing.
  2. Do items arrive at planning without being broken down? If items go from “product owner describes a feature” to “developer starts coding” without a decomposition step, start with Monolithic Work Items.
  3. Do developers work on branches for more than a day? If yes, the branching model allows and encourages large items. Start with Long-Lived Feature Branches.
  4. Do blocked items sit idle rather than getting picked up by another team member? If work stalls because it “belongs to” the assigned person and nobody else touches it, the assignment model is preventing swarming. Start with Push-Based Work Assignment.

Ready to fix this? The most common cause is Monolithic Work Items. Start with its How to Fix It section for week-by-week steps.


3.3 - Developer Experience Problems

Tooling friction, environment setup, local development, and codebase maintainability problems.

Symptoms related to the tools, environments, and codebase conditions that slow developers down day to day.

3.3.1 - AI Tooling Slows You Down Instead of Speeding You Up

It takes longer to explain the task to the AI, review the output, and fix the mistakes than it would to write the code directly.

What you are seeing

A developer opens an AI chat window to implement a function. They spend ten minutes writing a prompt that describes the requirements, the constraints, the existing patterns in the codebase, and the edge cases. The AI generates code. The developer reads through it line by line because they have no acceptance criteria to verify against. They spot that it uses a different pattern than the rest of the codebase and misses a constraint they mentioned. They refine the prompt. The AI produces a second version. It is better but still wrong in a subtle way. The developer fixes it by hand. Total time: forty minutes. Writing it themselves would have taken fifteen.

This is not a one-time learning curve. It happens repeatedly, on different tasks, across the team. Developers report that AI tools help with boilerplate and unfamiliar syntax but actively slow them down on tasks that require domain knowledge, codebase-specific patterns, or non-obvious constraints. The promise of “10x productivity” collides with the reality that without clear acceptance criteria, reviewing AI output means auditing the implementation detail by detail - which is often harder than writing the code from scratch.

Common causes

Skipping Specification and Prompting Directly

The most common cause of AI slowdown is jumping straight to code generation without defining what the change should do. Instead of writing an intent description, BDD scenarios, and acceptance criteria first, the developer writes a long prompt that mixes requirements, constraints, and implementation hints into a single message. The AI guesses at the scope. The developer reviews line by line because they have no checklist of expected behaviors. The prompt-review-fix cycle repeats until the output is close enough.

The specification workflow from the Agent Delivery Contract exists to prevent this. When the developer defines the intent (what the change should accomplish), the BDD scenarios (observable behaviors), and the acceptance criteria (how to verify correctness) before generating code, the AI has a constrained target and the developer has a checklist. If the specification for a single change takes more than fifteen minutes, the change is too large - split it.

Agents can help with specification itself. The agent-assisted specification workflow uses agents to find gaps in your intent, draft BDD scenarios, and surface edge cases - all before any code is generated. This front-loads the work where it is cheapest: in conversation, not in implementation review.

Read more: Agent-Assisted Specification

Missing Working Agreements on AI Usage

When the team has no shared understanding of which tasks benefit from AI and which do not, developers default to using AI on everything. Some tasks - writing a parser for a well-defined format, generating test fixtures, scaffolding boilerplate - are good AI targets. Other tasks - implementing complex business rules, debugging production issues, refactoring code with implicit constraints - are poor AI targets because the context transfer cost exceeds the implementation cost.

Without a shared agreement, each developer discovers this boundary independently through wasted time.

Read more: No Shared Workflow Expectations

Knowledge Silos

When domain knowledge is concentrated in a few people, the acceptance criteria for domain-heavy work exist only in those people’s heads. They can implement the feature faster than they can articulate the criteria for an AI prompt. For developers who do not have the domain knowledge, using AI is equally slow because they lack the criteria to validate the output against. Both situations produce slowdowns for different reasons - and both trace back to domain knowledge that has not been made explicit.

Read more: Knowledge Silos

How to narrow it down

  1. Are developers jumping straight to code generation without defining intent, scenarios, and acceptance criteria first? If the prompting-reviewing-fixing cycle consistently takes longer than direct implementation, the problem is usually skipped specification, not the AI tool. Start with Agent-Assisted Specification to define what the change should do before generating code.
  2. Does the team have a shared understanding of which tasks are good AI targets? If individual developers are discovering this through trial and error, the team needs working agreements. Start with the AI Adoption Roadmap to identify appropriate use cases.
  3. Are the slowest AI interactions on tasks that require deep domain knowledge? If AI struggles most where implicit business rules govern the implementation, the problem is not the AI tool but the knowledge distribution. Start with Knowledge Silos.

Ready to fix this? Start with Agent-Assisted Specification to learn the specification workflow that front-loads clarity before code generation.

3.3.2 - AI Is Generating Technical Debt Faster Than the Team Can Absorb It

AI tools produce working code quickly, but the codebase is accumulating duplication, inconsistent patterns, and structural problems faster than the team can address them.

What you are seeing

The team adopted AI coding tools six months ago. Feature velocity increased. But the codebase is getting harder to work in. Each AI-assisted session produces code that works - it passes tests, it satisfies the acceptance criteria - but it does not account for what already exists. The AI generates a new utility function that duplicates one three files away. It introduces a third pattern for error handling in a module that already has two. It copies a data access approach that the team decided to move away from last quarter.

Nobody catches these issues in review because the review standard is “does it do what it should and how do we validate it” - which is the right standard for correctness, but it does not address structural fitness. The acceptance criteria say what the change should do. They do not say “and it should use the existing error handling pattern” or “and it should not duplicate the date formatting utility.”

The debt is invisible in metrics. Test coverage is stable or improving. Change failure rate is flat. But development cycle time is creeping up because every new change must navigate around the inconsistencies the previous changes introduced. Refactoring is harder because the AI generated code in patterns the team did not choose and would not have written.

Common causes

No Scheduled Refactoring Sessions

AI generates code faster than humans refactor it. Without deliberate maintenance sessions scoped to cleaning up recently touched files, the codebase drifts toward entropy faster than it would with human-paced development. The team treats refactoring as something that happens organically during feature work, but AI-assisted feature sessions are scoped to their acceptance criteria and do not include cleanup.

The fix is not to allow AI to refactor during feature sessions - that mixes concerns and makes commits unreviewable. It is to schedule explicit refactoring sessions with their own intent, constraints, and acceptance criteria (all existing tests still pass, no behavior changes).

Read more: Pitfalls and Metrics - Schedule refactoring as explicit sessions

No Review Gate for Structural Quality

The team’s review process validates correctness (does it satisfy acceptance criteria?) and security (does it introduce vulnerabilities?) but not structural fitness (does it fit the existing codebase?). Standard review agents check for logic errors, security defects, and performance issues. None of them check whether the change duplicates existing code, introduces a third pattern where one already exists, or violates the team’s architectural decisions.

Automating structural quality checks requires two layers in the pre-commit gate sequence.

Layer 1: Deterministic tools

Deterministic tools run before any AI review and catch mechanical structural problems without token cost. These run in milliseconds and cannot be confused by plausible-looking but incorrect code. Add them to the pre-commit hook sequence alongside lint and type checking:

  • Duplication detection (e.g., jscpd) - flags when the same code block already exists elsewhere in the codebase. When AI generates a utility that already exists three files away, this catches it before review.
  • Complexity thresholds (e.g., ESLint complexity rule, lizard) - flags functions that exceed a cyclomatic complexity limit. AI-generated code tends toward deeply nested conditionals when the prompt does not specify a complexity budget.
  • Dependency and architecture rules (e.g., dependency-cruiser, ArchUnit) - encode module boundary constraints as code. When the team decided to move away from a direct database access pattern, architecture rules make violations a build failure rather than a code review comment.

These tools encode decisions the team has already made. Each one removes a category of structural drift from the review queue entirely.

Layer 2: Semantic review agent with architectural constraints

The semantic review agent can catch structural drift that deterministic tools cannot detect - like a third error-handling approach in a module that already has two - but only if the feature description includes architectural constraints. If the feature description covers only functional requirements, the agent has no basis for evaluating structural fit.

Add a constraints section to the feature description for every change:

  • “Use the existing UserRepository pattern - do not introduce new data access approaches”
  • “Error handling in this module follows the Result type pattern - do not introduce exceptions”
  • “New utilities belong in the shared/utils directory - do not create module-local utilities”

When the agent generates code that violates a stated constraint, the semantic review agent flags it. Without stated constraints, the agent cannot distinguish deliberate new patterns from drift.

The two layers are complementary. Deterministic tools handle mechanical violations fast and cheaply. The semantic review agent handles intent alignment and pattern consistency, but only where the feature description defines what those patterns are.

Read more: Coding and Review Agent Configuration - Semantic Review Agent

Rubber-Stamping AI-Generated Code

When developers do not own the change - cannot articulate what it does, what criteria they verified, or how they would detect a failure - they also do not evaluate whether the change fits the codebase. Structural quality requires someone to notice that the AI reinvented something that already exists. That noticing only happens when a human is engaged enough with the change to compare it against their knowledge of the existing system.

Read more: Rubber-Stamping AI-Generated Code

How to narrow it down

  1. Does the pre-commit gate include duplication detection, complexity limits, and architecture rules? If the only automated structural check is lint, the gate catches style violations but not structural drift. Add deterministic structural tools to the hook sequence described in Coding and Review Agent Configuration.
  2. Do feature descriptions include architectural constraints, not just functional requirements? If the feature description only says what the change should do but not how it should fit structurally, the semantic review agent has no basis for checking pattern conformance. Start by adding constraints to the Agent Delivery Contract.
  3. Is the team scheduling explicit refactoring sessions after feature work? If cleanup only happens incidentally during feature sessions, debt accumulates with every AI-assisted change. Start with the Pitfalls and Metrics guidance on scheduling maintenance sessions after every three to five feature sessions.
  4. Can developers identify where a new change duplicates existing code? If nobody in the review process is comparing the AI’s output against existing utilities and patterns, the team is not engaged enough with the change to catch structural drift. Start with Rubber-Stamping AI-Generated Code.

Ready to fix this? Start with the pre-commit gate. Add duplication detection and architecture rules to the hook sequence from Coding and Review Agent Configuration, then add architectural constraints to your feature description template. These two changes automate detection of the most common structural drift patterns on every change.

3.3.3 - Data Pipelines and ML Models Have No Deployment Automation

Application code has a CI/CD pipeline, but ML models and data pipelines are deployed manually or on an ad hoc schedule.

What you are seeing

ML models and data pipelines are deployed manually while application code has a full CI/CD pipeline. When a developer pushes a change to the application, tests run, an artifact is built, and deployment promotes automatically through environments. But the ML model that drives the product’s recommendations was trained two months ago and deployed by a data scientist who ran a Python script from their laptop. Nobody knows which version of the model is in production or what training data it was built on.

Data pipelines have a similar problem. The ETL job that populates the feature store was written in a Jupyter notebook, runs on a schedule via a cron job on a single server, and is updated by manually copying a new version to the server when it changes. There is no version control for the notebook, no automated tests for the pipeline logic, and no staging environment where the pipeline can be validated before it runs against production data.

Common causes

Missing deployment pipeline

The pipeline infrastructure that handles application deployments was not extended to cover model artifacts and data pipelines. Extending it requires ML-aware tooling - model registries, data versioning, training pipelines - that must be built or configured separately from standard application pipeline tools.

Establishing basic practices first - version control for pipeline code, a model registry with version tracking, automated tests for pipeline logic - creates the foundation. A minimal pipeline that validates data pipeline changes before production deployment closes the gap between how application code and model artifacts are treated, removing the dual delivery standard.

Read more: Missing deployment pipeline

Manual deployments

The default for ML work is manual because the discipline of ML operations is younger than software deployment automation. Without deliberate investment in model deployment automation, manual remains the default: a data scientist deploys a model by running a script, updating a config file, or copying files to a server.

Applying the same deployment automation principles to model deployment - versioned artifacts, automated promotion, health checks after deployment - closes the gap between ML and application delivery standards.

Read more: Manual deployments

Knowledge silos

Model deployment and data pipeline operations often live with specific individuals who have the expertise and the access to execute them. When those people are unavailable, model retraining, pipeline updates, and deployment operations cannot happen. The knowledge of how the ML infrastructure works is not distributed.

Documenting deployment procedures, building runbooks for model rollback, and cross-training team members on data infrastructure operations distributes the knowledge before automation is in place.

Read more: Knowledge silos

How to narrow it down

  1. Is the currently deployed model version tracked in version control with a record of when it was deployed? If not, there is no audit trail for model deployments. Start with Missing deployment pipeline.
  2. Can any engineer deploy an updated model or data pipeline, or does it require a specific person? If specific expertise is required, the knowledge is siloed. Start with Knowledge silos.
  3. Are data pipeline changes validated in a non-production environment before running against production data? If not, data pipeline changes go directly to production without validation. Start with Manual deployments.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.3.4 - The Codebase No Longer Reflects the Business Domain

Business terms are used inconsistently. Domain rules are duplicated, contradicted, or implicit. No one can explain all the invariants the system is supposed to enforce.

What you are seeing

The same business concept goes by three different names in three different modules. A rule about how orders are validated exists in the API layer, partially in a service, and also in the database - with slight differences between them. A developer making a change to the payments flow discovers undocumented assumptions mid-implementation and is not sure whether they are intentional constraints or historical accidents.

New developers cannot form a coherent mental model of the domain from the code alone. They learn by asking colleagues, but colleagues often disagree or are uncertain. The system works, mostly, but nobody can fully explain why it is structured the way it is or what would break if a particular constraint were removed.

Common causes

Thin-Spread Teams

When engineers rotate through a domain without staying long enough to understand its business rules deeply, each rotation leaves its own layer of interpretation on the codebase. One team names a concept one way. The next team introduces a parallel concept with a different name because they did not recognize the existing one. A third team adds a validation rule without knowing an equivalent rule already existed elsewhere. Over time the code reflects the sequence of teams that worked in it rather than the business domain it is supposed to model.

Read more: Thin-Spread Teams

Knowledge Silos

When the canonical understanding of the domain lives in a few individuals, the code drifts from that understanding whenever those individuals are not involved in a change. Developers without deep domain knowledge make reasonable-seeming implementation choices that violate rules they were never told about. The gap between what the domain expert knows and what the code expresses widens with each change made without them.

Read more: Knowledge Silos

How to narrow it down

  1. Are the same business concepts named differently in different parts of the codebase? If a developer must learn multiple synonyms for the same thing to navigate the code, the domain model has been interpreted independently by multiple teams. Start with Thin-Spread Teams.
  2. Can team members explain all the validation rules the system enforces, and do their explanations agree? If there is disagreement or uncertainty, domain knowledge is not shared or externalized. Start with Knowledge Silos.

Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.


3.3.5 - The Development Workflow Has Friction at Every Step

Slow CI servers, poor CLI tools, and no IDE integration. Every step in the development process takes longer than it should.

What you are seeing

The CI servers are slow. A build that should take 5 minutes takes 25 because the agents are undersized and the queue is long. The IDE has no integration with the team’s testing framework, so running a specific test requires dropping to the command line and remembering the exact invocation syntax. The deployment CLI has no tab completion and cryptic error messages. The local development environment requires a 12-step ritual to restart after any configuration change.

Individual friction points seem minor in isolation. A 20-second wait is a slight inconvenience. A missing IDE shortcut is a small annoyance. But friction compounds. A developer who waits 20 seconds, remembers a command, waits 20 more seconds, then navigates an opaque error message has spent a minute on a task that should take 5 seconds. Across ten such interactions per day, across an entire team, this is a meaningful tax on throughput.

The larger cost is attentional, not temporal. Friction interrupts flow. When a developer has to stop thinking about the problem they are solving to remember a command syntax, context-switch to a different tool, or wait for an operation to complete, they lose the thread. Flow states that make complex problems tractable are incompatible with constant context switches caused by tooling friction.

Common causes

Missing deployment pipeline

Investment in pipeline tooling - build caching, parallelized test execution, automated deployment scripts with good error messages - directly reduces the friction of getting changes to production. Teams without this investment accumulate tooling debt. Each year that passes without improving the pipeline leaves a more elaborate set of workarounds in place.

A team that treats the pipeline as a first-class product, maintained and improved the same way they maintain production code, eliminates friction points incrementally. The slow CI queue, the missing IDE integration, the opaque deployment errors - each one is a bug in the pipeline product, and bugs get fixed when someone owns the product.

Read more: Missing deployment pipeline

Manual deployments

When the deployment process is manual, there is no pressure to make the tooling ergonomic. The person doing the deployment learns the steps and adapts. Automation forces the deployment process to be scripted, which creates an interface that can be improved, tested, and measured. A deployment script with good error messages and clear output is a better tool than a deployment runbook, and it can be improved as a piece of software.

Read more: Manual deployments

How to narrow it down

  1. How long does a full pipeline run take? If builds take more than 10 minutes, build caching and parallelization are likely available but not implemented. Start with Missing deployment pipeline.
  2. Can a developer deploy with a single command that provides clear output? If deployment requires multiple manual steps with opaque error messages, the tooling has not been invested in. Start with Manual deployments.
  3. Are builds getting faster over time? If build time is stable or increasing, nobody is actively working on pipeline performance. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.3.6 - Getting a Test Environment Requires Filing a Ticket

Test environments are a scarce, contended resource. Provisioning takes days and requires another team’s involvement.

What you are seeing

A developer needs a clean environment to reproduce a bug. They file a ticket with the infrastructure team requesting environment access. The ticket enters a queue. Two days later, the environment is provisioned. By that time the developer has moved on to other work, the context for the bug is cold, and the urgency has faded.

Test environments are scarce because they are expensive to create manually. The infrastructure team provisions each one by hand: configuring servers, installing dependencies, seeding databases, updating DNS. The process takes hours of skilled work. Because it takes hours, environments are treated as long-lived shared resources rather than disposable per-task resources. Multiple teams share the same staging environment, which creates contention, coordination overhead, and mysterious failures when two teams’ work interacts unexpectedly.

The team has adapted by scheduling environment usage in advance and batching testing work. These adaptations work until there is a deadline, at which point contention over shared environments becomes a delivery risk.

Common causes

Snowflake environments

When environments are configured by hand, they cannot be created on demand. The cost of creating a new environment is the same as the cost of the initial configuration: hours of skilled work. This cost makes environments permanent rather than ephemeral. Infrastructure as code and containerization make environment creation a fast, automated operation that any team member can trigger.

When environments can be created in minutes from code, they stop being scarce. A developer who needs an environment can create one, use it, and destroy it. Two teams working on conflicting features each have their own environment. Contention disappears.

Read more: Snowflake environments

Missing deployment pipeline

Pipelines that include environment provisioning steps can spin up, run tests against, and tear down ephemeral environments as part of every run. The environment is created fresh for each test run and destroyed when the run completes. Without this capability, environments are managed manually outside the pipeline and must be shared.

A pipeline with environment provisioning gives every commit its own isolated environment. There is no ticket to file, no queue to wait in, no contention with other teams - the environment exists for the duration of the run and is gone when the run completes.

Read more: Missing deployment pipeline

Knowledge silos

The knowledge of how to provision an environment lives in the infrastructure team. Until that knowledge is codified as scripts or infrastructure code, environment creation requires a human from that team. The infrastructure team becomes a bottleneck even when they are working as fast as they can.

Externalizing environment provisioning knowledge into code - reproducible, runnable by anyone - removes the dependency on the infrastructure team for routine environment needs.

Read more: Knowledge silos

How to narrow it down

  1. Can a developer create a new isolated test environment without filing a ticket? If not, environment creation is not self-service. Start with Snowflake environments.
  2. Do multiple teams share a single staging environment? Shared environments create contention and interference. Start with Missing deployment pipeline.
  3. Is environment provisioning knowledge documented as runnable code? If provisioning requires knowing undocumented manual steps, the knowledge is siloed. Start with Knowledge silos.

Ready to fix this? The most common cause is Snowflake environments. Start with its How to Fix It section for week-by-week steps.

3.3.7 - The Deployment Target Does Not Support Modern CI/CD Tooling

Mainframes or proprietary platforms require custom integration or manual steps. CD practices stop at the boundary of the legacy stack.

What you are seeing

The deployment target is a z/OS mainframe, an AS/400, an embedded device firmware platform, or a proprietary industrial control system. The standard CI/CD tools the rest of the organization uses do not support this target. The vendor’s deployment tooling is command-line based, requires a licensed runtime, and was designed around a workflow that predates modern software delivery practices.

The team’s modern application code lives in a standard git repository with a standard pipeline for the web tier. But the batch processing layer, the financial calculation engine, or the device firmware is deployed through a completely separate process involving FTP, JCL job cards, and a deployment checklist that exists as a Word document on a shared drive.

The organization’s CD practices stop at the boundary of the modern stack. The legacy platform exists in a different operational world with different tooling, different skills, different deployment cadence, and different risk models. Bridging the two worlds requires custom integration work that is unglamorous, expensive, and consistently deprioritized.

Common causes

Manual deployments

Legacy platform deployments are almost always manual. The platform predates modern deployment automation. The deployment procedure exists in documentation and in the heads of the people who have done it. Without investment in custom tooling, mainframe deployments remain manual indefinitely.

Building automation for a mainframe or proprietary platform requires understanding both the platform’s native tools and modern automation principles. The result may not look like a standard pipeline, but it can provide the same benefits: consistent, repeatable, auditable deployments that do not require a specific person.

Read more: Manual deployments

Missing deployment pipeline

A pipeline that covers the full deployment surface - modern application code, database changes, and legacy platform components - requires platform-specific extensions. Standard pipeline tools do not ship with mainframe support, but they can be extended with custom steps that invoke platform-native tools. Without this investment, the pipeline covers only the modern stack.

Building coverage incrementally - wrapping the most common deployment operations first, then expanding - is more achievable than trying to fully automate a complex legacy deployment in one effort.

Read more: Missing deployment pipeline

Knowledge silos

Mainframe and proprietary platform skills are rare and concentrating. Teams typically have one or two people who understand the platform deeply. When those people leave, the deployment process becomes opaque to everyone remaining. The knowledge that enables manual deployments is not distributed and not documented in a form anyone else can use.

Deliberately distributing platform knowledge - pair deployments, written procedures, runbooks that reflect the actual current process - reduces single-person dependency even before automation is available.

Read more: Knowledge silos

How to narrow it down

  1. Is there anyone on the team other than one or two people who can deploy to the legacy platform? If not, knowledge concentration is the immediate risk. Start with Knowledge silos.
  2. Is the legacy platform deployment automated in any way? If completely manual, automation of even one step is a starting point. Start with Manual deployments.
  3. Is the legacy platform deployment included in the same pipeline as modern services? If it is managed outside the pipeline, it lacks all the pipeline’s safety properties. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.

3.3.8 - Developers Cannot Run the Pipeline Locally

The only way to know if a change passes CI is to push it and wait. Broken builds are discovered after commit, not before.

What you are seeing

A developer makes a change, commits, and pushes to CI. Thirty minutes later, the build is red. A linting rule was violated. Or a test file was missing from the commit. Or the build script uses a different version of a dependency than the developer’s local machine. The developer fixes the issue and pushes again. Another wait. Another failure - this time a test that only runs in CI and not in the local test suite.

This cycle destroys focus. The developer cannot stay in flow waiting for CI results. They switch to something else, then switch back when the notification arrives. Each context switch adds recovery time. A change that took thirty minutes to write takes two hours from first commit to green build, and the developer was not thinking about it for most of that time.

The deeper issue is that CI and local development are different environments. Tests that pass locally fail in CI because of dependency version differences, missing environment variables, or test execution order differences. The developer cannot reproduce CI failures locally, which makes them much harder to debug and creates a pattern of “push and hope” rather than “validate locally and push with confidence.”

Common causes

Missing deployment pipeline

Pipelines designed for cloud-only execution - pulling from private artifact repositories, requiring CI-specific secrets, using platform-specific compute resources - cannot run locally by construction. The pipeline was designed for the CI environment and only the CI environment.

Pipelines designed with local execution in mind use tools that run identically in any environment: containerized build steps, locally runnable test commands, shared dependency resolution. A developer running the same commands locally that the pipeline runs in CI gets the same results. The feedback loop shrinks from 30 minutes to seconds.

Read more: Missing deployment pipeline

Snowflake environments

When the CI environment differs from the developer’s local environment in ways that affect test outcomes, local and CI results diverge. Different OS versions, different dependency caches, different environment variables, different file system behaviors - any of these can cause tests to pass locally and fail in CI.

Standardized, code-defined environments that run identically locally and in CI eliminate the divergence. If the build step runs inside the same container image locally and in CI, the results are the same.

Read more: Snowflake environments

How to narrow it down

  1. Can a developer run every pipeline step locally? If any step requires CI-specific infrastructure, secrets, or platform features, that step cannot be validated before pushing. Start with Missing deployment pipeline.
  2. Do tests produce different results locally versus in CI? If yes, the environments differ in ways that affect test outcomes. Start with Snowflake environments.
  3. How long does a developer wait between push and feedback? If feedback takes more than a few minutes, the incentive is to batch pushes and work on something else while waiting. Start with Missing deployment pipeline.

Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.

3.3.9 - Setting Up a Development Environment Takes Days

New team members are unproductive for their first week. The setup guide is 50 steps long and always out of date.

What you are seeing

A new developer spends two days troubleshooting before the system runs locally. The wiki setup page was last updated 18 months ago. Step 7 refers to a tool that has been replaced. Step 12 requires access to a system that needs a separate ticket to provision. Step 19 assumes an operating system version that is three versions behind. Getting unstuck requires finding a teammate who has memorized the real procedure from experience.

The setup problem is not just a new-hire experience. It affects the entire team whenever someone gets a new machine, switches between projects, or tries to set up a second environment for a specific debugging purpose. The environment is fragile because it was assembled by hand and the assembly process was never made reproducible.

The business cost is usually invisible. Two days of new-hire setup is charged to onboarding. Senior engineers spending half a day helping unblock new hires is charged to sprint work. Developers who avoid setting up new environments and work around the problem are charged to productivity. None of these costs appear on a dashboard that anyone monitors.

Common causes

Snowflake environments

When development environments are not reproducible from code, the assembly process exists only in documentation (which drifts) and in the heads of people who have done it before (who are not always available). Each environment is assembled slightly differently, which means the “how to set up a development environment” question has as many answers as there are developers on the team.

When the environment definition is versioned alongside the code, setup becomes a single command. A new developer who runs that command gets the same working environment as everyone else on the team - no 18-month-old wiki page, no tribal knowledge required, no two-day troubleshooting session. When the code changes in ways that require environment changes, the environment definition is updated at the same time.

Read more: Snowflake environments

Knowledge silos

The real setup procedure exists in the heads of specific team members who have run it enough times to know which steps to skip and which to do differently on which operating systems. When those people are unavailable, setup fails. The knowledge gap is only visible when someone needs it.

When environment setup is codified as runnable scripts and containers, the knowledge is distributed to everyone who can read the code. A new developer no longer has to find the one person who remembers which steps to skip - they run the script, and it works.

Read more: Knowledge silos

Tightly coupled monolith

When running any part of the application requires the full monolith running - including all its dependencies, services, and backing infrastructure - local setup is inherently complex. A developer who only needs to work on the notification service must stand up the entire application, all its databases, and all the services the notification service depends on, which is everything.

Decomposed services with stable interfaces can be developed in isolation. A developer working on the notification service stubs the services it calls and focuses on the piece they are changing. Setup is proportional to scope.

Read more: Tightly coupled monolith

How to narrow it down

  1. Can a new team member set up a working development environment without help? If not, the setup process is not self-contained. Start with Snowflake environments.
  2. Does setup require tribal knowledge that is not captured in the documented procedure? If team members need to “fill in the gaps” from memory, that knowledge needs to be externalized. Start with Knowledge silos.
  3. Does running a single service require running the entire application? If so, local development is inherently complex. Start with Tightly coupled monolith.

Ready to fix this? The most common cause is Snowflake environments. Start with its How to Fix It section for week-by-week steps.

3.3.10 - Bugs in Familiar Areas Take Disproportionately Long to Fix

Defects that should be straightforward take days to resolve because the people debugging them are learning the domain as they go. Fixes sometimes introduce new bugs in the same area.

What you are seeing

A bug is filed against the billing module. It looks simple from the outside - a calculation is off by a percentage in certain conditions. The developer assigned to it spends a day reading code before they can even reproduce the problem reliably. The fix takes another day. Two weeks later, a related bug appears: the fix was correct for the case it addressed but violated an assumption elsewhere in the module that nobody told the developer about.

Defect resolution time in specific areas of the system is consistently longer than in others. Post-mortems note that the fix was made by someone unfamiliar with the domain. Bugs cluster in the same modules, with fixes that address the symptom rather than the underlying rule that was violated.

Common causes

Knowledge Silos

When only a few people understand a domain deeply, defects in that domain can only be resolved quickly by those people. When they are unavailable - on leave, on another team, or gone - the bug sits or gets assigned to someone who must reconstruct context before they can make progress. The reconstruction is slow, incomplete, and prone to introducing new violations of rules the developer discovers only after the fact.

Read more: Knowledge Silos

Thin-Spread Teams

When engineers are rotated through a domain based on capacity, the person available to fix a bug is often not the person who knows the domain. They are familiar with the tech stack but not with the business rules, edge cases, and historical decisions that make the module behave the way it does. Debugging becomes an exercise in reverse-engineering domain knowledge from code that may not accurately reflect the original intent.

Read more: Thin-Spread Teams

How to narrow it down

  1. Are defect resolution times consistently longer in specific modules than in others? If certain areas of the system take significantly longer to debug regardless of defect severity, those areas have a knowledge concentration problem. Start with Knowledge Silos.
  2. Do fixes in certain areas frequently introduce new bugs in the same area? If corrections create new violations, the developer fixing the bug lacks the domain knowledge to understand the full set of constraints they are working within. Start with Thin-Spread Teams.

Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.


3.4 - Team and Knowledge Problems

Team stability, knowledge transfer, collaboration, and shared practices problems.

Symptoms related to team composition, knowledge distribution, and how team members work together.

3.4.1 - The Team Has No Shared Working Hours Across Time Zones

Code reviews wait overnight. Questions block for 12+ hours. Async handoffs replace collaboration.

What you are seeing

A developer in London finishes a piece of work at 5 PM and creates a pull request. The reviewer in San Francisco is starting their day but has morning meetings and gets to the review at 2 PM Pacific - which is 10 PM London time, the next day. The author is offline. The reviewer leaves comments. The author responds the following morning. The review cycle takes four days for a change that would have taken 20 minutes with any overlap.

Integration conflicts sit unresolved for hours. The developer who could resolve the conflict is asleep when it is discovered. By the time they wake up, the main branch has moved further. Resolving the conflict now requires understanding changes made by multiple people across multiple time zones, none of whom are available simultaneously to sort it out.

The team has adapted with async-first practices: detailed PR descriptions, recorded demos, comprehensive written documentation. These adaptations reduce the cost of asynchrony but do not eliminate it. The team’s throughput is bounded by communication latency, and the work items that require back-and-forth are the most expensive.

Common causes

Long-lived feature branches

Long-lived branches mean that integration conflicts are larger and more complex when they finally surface. Resolving a small conflict asynchronously is tolerable. Resolving a three-day branch merge asynchronously is genuinely difficult - the changes are large, the context for each change is spread across people in different time zones, and the resolution requires understanding decisions made by people who are not available.

Frequent, small integrations to trunk reduce conflict size. A conflict that would have been 500 lines with a week-old branch is 30 lines when branches are integrated daily.

Read more: Long-lived feature branches

Monolithic work items

Large items create larger diffs, more complex reviews, and more integration conflicts. In a distributed team, the time cost of large items is amplified by communication overhead. A review that requires one round of comments takes one day in a distributed team. A review that requires three rounds takes three days. Large items that require extensive review are expensive by construction.

Small items have small diffs. Small diffs require fewer review rounds. Fewer review rounds means faster cycle time even with the communication latency of a distributed team.

Read more: Monolithic work items

Knowledge silos

When critical knowledge lives in one person and that person is in a different time zone, questions block for 12 or more hours. The developer in Singapore who needs to ask the database expert in London waits overnight for each exchange. Externalizing knowledge into documentation, tests, and code comments reduces the per-question communication overhead.

When the answer to a common question is in a runbook, a developer does not need to wait for the one person who knows. The knowledge is available regardless of time zone.

Read more: Knowledge silos

How to narrow it down

  1. What is the average number of review round-trips for a pull request? Each round-trip adds approximately one day of latency in a distributed team. Reducing item size reduces review complexity. Start with Monolithic work items.
  2. How often do integration conflicts require synchronous discussion to resolve? If conflicts regularly need a real-time conversation, they are large enough that asynchronous resolution is impractical. Start with Long-lived feature branches.
  3. Do developers regularly wait overnight for answers to questions? If yes, the knowledge needed for daily work is not accessible without specific people. Start with Knowledge silos.

Ready to fix this? The most common cause is Long-lived feature branches. Start with its How to Fix It section for week-by-week steps.

3.4.2 - Retrospectives Produce No Real Change

The same problems surface every sprint. Action items are never completed. The team has stopped believing improvement is possible.

What you are seeing

The same themes come up every sprint: too much interruption, unclear requirements, flaky tests, blocked items. The retrospective runs every two weeks. Action items are assigned. Two weeks later, none of them were completed because sprint work took priority. The same themes come up again. Someone adds them to the growing backlog of process improvements.

The team goes through the motions because the meeting is scheduled, not because they believe it will produce change. Participation is minimal. The facilitator works harder each time to generate engagement. The conversation stays surface-level because raising real problems feels pointless - nothing changes anyway.

The dysfunction runs deeper than meeting format. There is no capacity allocated for improvement work. Every sprint is 100% allocated to feature delivery. Action items that require real investment - automated deployment, test infrastructure, architectural cleanup - compete for time against items with committed due dates. The outcome is predetermined: features win.

Common causes

Unbounded WIP

When the team has more work in progress than capacity, every sprint has no slack. Action items from retrospectives require slack to complete. Without slack, improvement work is always displaced by feature work. The team is too busy to get less busy.

Creating and protecting capacity for improvement work is the prerequisite for retrospectives to produce change. Teams that allocate a fixed percentage of each sprint to improvement work - and defend it against feature pressure - actually complete their retrospective action items.

Read more: Unbounded WIP

Push-based work assignment

When work is assigned to the team from outside, the team has no authority over their own capacity allocation. They cannot protect time for improvement work because the queue is filled by someone else. Even if the team agrees in the retrospective that test automation is the priority, the next sprint’s work arrives already planned with no room for it.

Teams that pull work from a prioritized backlog and control their own capacity can make and honor commitments to improvement work. The retrospective can produce action items that the team has the authority to complete.

Read more: Push-based work assignment

Deadline-driven development

When management drives to fixed deadlines, all available capacity goes toward meeting the deadline. Improvement work that does not advance the deadline has no chance. The retrospective can surface the same problems indefinitely, but if the team has no capacity to address them and no organizational support to get that capacity, improvement is structurally impossible.

Read more: Deadline-driven development

How to narrow it down

  1. Are retrospective action items ever completed? If not, capacity is the first issue to examine. Start with Unbounded WIP.
  2. Does the team control how their sprint capacity is allocated? If improvement work must compete against externally assigned feature work, the team lacks the authority to act on retrospective outcomes. Start with Push-based work assignment.
  3. Is the team under sustained deadline pressure with no slack? If the team is always in crunch, improvement work has no room regardless of capacity or authority. Start with Deadline-driven development.

Ready to fix this? The most common cause is Unbounded WIP. Start with its How to Fix It section for week-by-week steps.

3.4.3 - The Team Has No Shared Agreements About How to Work

No explicit agreements on branch lifetime, review turnaround, WIP limits, or coding standards. Everyone does their own thing.

What you are seeing

Half the team uses feature branches; half commit directly to main. Some developers expect code reviews to happen within a few hours; others consider three days fast. Some engineers put every change through a full review; others self-merge small fixes. The WIP limit is nominally three items per person but nobody enforces it and most people carry five or six.

These inconsistencies create friction that is hard to name. Pull requests sit because there is no shared expectation for turnaround. Work items age because there is no agreement about WIP limits. Code quality varies because there is no agreement about review standards. The team functions, but at a lower level of coordination than it could with explicit norms.

The problem compounds as the team grows or becomes more distributed. A two-person co-located team can operate on implicit norms that emerge from constant communication. A six-person distributed team cannot. Without explicit agreements, each person operates on different mental models formed by prior team experiences.

Common causes

Push-based work assignment

When work is assigned to individuals by a manager or lead, team members operate as independent contributors rather than as a team managing flow together. Shared workflow norms only emerge meaningfully when the team experiences work as a shared responsibility - when they pull from a common queue, track shared flow metrics, and collectively own the delivery outcome.

Teams that pull work from a shared backlog develop shared norms because they need those norms to function - without agreement on review turnaround and WIP limits, pulling from the same queue becomes chaotic. When work is individually assigned, each person optimizes for their assigned items, not for team flow, and the shared agreements never form.

Read more: Push-based work assignment

Unbounded WIP

When there are no WIP limits, every norm around flow is implicitly optional. If work can always be added without limit, discipline around individual items erodes. “I’ll review that PR later” is always a reasonable response when there is always more work competing for attention.

WIP limits create the conditions where norms matter. When the team is committed to a WIP limit, review turnaround, merge cadence, and integration frequency become practical necessities rather than theoretical preferences.

Read more: Unbounded WIP

Thin-spread teams

Teams spread across many responsibilities often lack the continuous interaction needed to develop and maintain shared norms. Each member is operating in a different context, interacting with different parts of the codebase, working with different constraints. Common ground for shared agreements is harder to establish when everyone’s daily experience is different.

Read more: Thin-spread teams

How to narrow it down

  1. Does the team have written working agreements that everyone follows? If agreements are verbal or assumed, they will diverge under pressure. The absence of written agreements is the starting point.
  2. Do team members pull from a shared queue or receive individual assignments? Individual assignment reduces team-level flow ownership. Start with Push-based work assignment.
  3. Does the team enforce WIP limits? Without enforced limits, work accumulates until norms break down. Start with Unbounded WIP.

Ready to fix this? The most common cause is Push-based work assignment. Start with its How to Fix It section for week-by-week steps.

3.4.4 - The Same Mistakes Happen in the Same Domain Repeatedly

Post-mortems and retrospectives show the same root causes appearing in the same areas. Each new team makes decisions that previous teams already tried and abandoned.

What you are seeing

A post-mortem reveals that the payments module failed in the same way it failed eighteen months ago. The fix applied then was not documented, and the developer who applied it is no longer on the team. A retrospective surfaces a proposal to split the monolith into services - a direction the team two rotations ago evaluated and rejected for reasons nobody on the current team knows.

The same conversations happen repeatedly. The same edge cases get missed. The same architectural directions get proposed, piloted, and quietly abandoned without any record of why. Each new group treats the domain as a fresh problem rather than building on what was learned before.

Common causes

Thin-Spread Teams

When engineers are rotated through a domain based on capacity rather than staying long enough to build expertise, institutional memory does not accumulate. The decisions, experiments, and hard lessons from previous rotations leave with those developers. The next group inherits the code but not the understanding of why it is structured the way it is, what was tried before, or what the failure modes are. They are likely to repeat the same exploration, reach the same dead ends, and make the same mistakes.

Read more: Thin-Spread Teams

Knowledge Silos

When knowledge about a domain lives only in specific individuals, it evaporates when they leave. Architectural decision records, runbooks, and documented post-mortem outcomes are the externalized forms of that knowledge. Without them, every departure is a partial reset. The remaining team cannot distinguish between “we haven’t tried that” and “we tried that and here is what happened.”

Read more: Knowledge Silos

How to narrow it down

  1. Do post-mortems show the same root causes in the same areas of the system? If recurring incidents map to the same modules and the fixes do not persist, the team is not accumulating learning. Start with Thin-Spread Teams.
  2. Are architectural proposals evaluated without knowledge of what was tried before? If the team cannot answer “was this approach considered previously, and what happened,” decisions are being made without institutional memory. Start with Knowledge Silos.

Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.


3.4.5 - Delivery Slows Every Time the Team Rotates

A new developer joins or is flexed in and delivery slows for weeks while they learn the domain. The pattern repeats with every rotation.

What you are seeing

A developer is moved onto the team because there is capacity there and they know the tech stack. For the first two to three weeks, velocity drops. Simple changes take longer than expected because the new person is learning the domain while doing the work. They ask questions that previous team members would have answered instantly. They make safe, conservative choices to avoid breaking something they don’t fully understand.

Then the rotation ends or another team member is pulled away, and the cycle starts again. The team never fully recovers its pre-rotation pace before the next disruption. Velocity measured across a quarter looks flat even though the team is working as hard as ever.

Common causes

Thin-Spread Teams

When engineers are treated as interchangeable capacity and moved to where utilization is needed, the team never develops stable domain expertise. Each rotation brings someone who knows the technology but not the business rules, the data model quirks, the historical decisions, or the failure modes that prior members learned through experience. The knowledge required to deliver quickly in a domain cannot be acquired in days. It accumulates over months of working in it.

Read more: Thin-Spread Teams

Knowledge Silos

When domain knowledge lives in individuals rather than in documentation, runbooks, and code structure, it is not available to the next person who joins. The new team member must reconstruct understanding that the previous person carried in their head. Every rotation restarts that reconstruction from scratch.

Read more: Knowledge Silos

How to narrow it down

  1. Does velocity measurably drop for several weeks after a team change? If the pattern is consistent and repeatable, the team’s delivery speed depends on individual domain knowledge rather than shared, documented understanding. Start with Thin-Spread Teams.
  2. Is domain knowledge written down or does it live in specific people? If new team members learn by asking colleagues rather than reading documentation, the knowledge is not externalized. Start with Knowledge Silos.

Ready to fix this? The most common cause is Thin-Spread Teams. Start with its How to Fix It section for week-by-week steps.


3.4.6 - Team Membership Changes Constantly

Members are frequently reassigned to other projects. There are no stable working agreements or shared context.

What you are seeing

The team roster changes every quarter. Engineers are pulled to other projects because they have relevant expertise, or they move to new teams as part of organizational restructuring. New members join but onboarding is informal - there is no written record of how the team works, what decisions were made and why, or what the technical context is.

The CD migration effort restarts with every significant roster change. New members bring different mental models and prior experiences. Practices the team adopted with care - trunk-based development, WIP limits, short-lived branches - get questioned by each new cohort who did not experience the problems those practices were designed to solve. The team keeps relitigating settled decisions instead of making progress.

The organizational pattern treats individual contributors as interchangeable resources. An engineer with payment domain expertise can be moved to the infrastructure team because the headcount numbers work out. The cost of that move - lost context, restarted relationships, degraded team performance for months - is invisible to the planning process that made the decision.

Common causes

Knowledge silos

When knowledge lives in individuals rather than in team practices, documentation, and code, departures create immediate gaps. The cost of reassignment is higher when the departing person carries critical knowledge that was never externalized. Losing one person does not just reduce capacity by one; it can reduce effective capability by much more if that person was the only one who understood a critical system or practice.

Teams that externalize knowledge into runbooks, architectural decision records, and documented practices distribute the cost of any individual departure. No single person’s absence leaves a critical gap. When a new cohort joins, the documented decisions and rationale are already there - the team stops relitigating trunk-based development and WIP limits because the record of why those choices were made is readable, not verbal.

Read more: Knowledge silos

Unbounded WIP

Teams with too much in progress are more likely to have members pulled to other projects, because they appear to have capacity even when they are spread thin. If a developer is working on five things simultaneously, moving them to another project looks like it frees up a resource. The depth of their contribution to each item is invisible to the person making the assignment decision.

WIP limits make the team’s actual capacity visible. When each person is focused on one or two things, it is clear that they are fully engaged and that removing them would directly impact those items. The reassignments that have been disrupting the team’s CD progress become less frequent because the real cost is finally visible to whoever is making the staffing decision.

Read more: Unbounded WIP

Thin-spread teams

When a team’s members are already distributed across many responsibilities, any departure creates disproportionate impact. Thin-spread teams have no redundancy to absorb turnover. Each person’s departure leaves a hole in a different area of the team’s responsibility surface.

Teams with focused, overlapping responsibilities can absorb turnover because multiple people share each area of responsibility. Redundancy is built in rather than assumed to exist. When a member is reassigned, the team’s work continues without a collapse in that area - the constant restart cycle that has been stalling the CD migration does not recur with every roster change.

Read more: Thin-spread teams

Push-Based Work Assignment

When work is assigned by specialty - “you’re the database person, so you take the database stories” - knowledge concentrates in individuals rather than spreading across the team. The same person always works the same area, so only they understand it deeply. When that person is reassigned or leaves, no one else can continue their work without starting over. Push-based assignment continuously deepens the knowledge silos that make every roster change more disruptive.

Read more: Push-Based Work Assignment

How to narrow it down

  1. Is critical system knowledge documented or does it live in specific individuals? If departures create knowledge gaps, the team has knowledge silos regardless of who leaves. Start with Knowledge silos.
  2. Does the team appear to have capacity because members are spread across many items? High WIP makes team members look available for reassignment. Start with Unbounded WIP.
  3. Is each team member the sole owner of a distinct area of the team’s work? If so, any departure leaves an unmanned responsibility. Start with Thin-spread teams.
  4. Is work assigned by specialty so the same person always works the same area? If departures leave knowledge gaps in specific parts of the system, assignment by specialty is reinforcing the silos. Start with Push-Based Work Assignment.

Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.

4 - Production Visibility and Team Health

Symptoms related to production observability, incident detection, environment parity, and team sustainability.

These symptoms indicate problems with how your team sees and responds to production issues. When problems are invisible until customers report them, or when the team is burning out from process overhead, the delivery system is working against the people in it. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Monitoring and Observability Anti-Patterns, Organizational and Cultural Anti-Patterns

Related guides: Progressive Rollout, Working Agreements, Metrics-Driven Improvement

4.1 - The Team Ignores Alerts Because There Are Too Many

Alert volume is so high that pages fire for non-issues. Real problems are lost in the noise.

What you are seeing

The on-call phone goes off fourteen times this week. Eight of the pages were non-issues that resolved on their own. Three were false positives from a known monitoring misconfiguration that nobody has prioritized fixing. One was a real problem. The on-call engineer, conditioned by a week of false positives, dismisses the real page as another false alarm. The real problem goes unaddressed for four hours.

The team has more alerts than they can respond to meaningfully. Every metric has an alert. The thresholds were set during a brief period when everything was running smoothly and nobody has touched them since. When a database is slow, thirty alerts fire simultaneously for every downstream metric that depends on database performance. The alert storm is worse than the underlying problem.

Alert fatigue develops slowly. It starts with a few noisy alerts that are tolerated because fixing them is less urgent than current work. Each new service adds more alerts calibrated optimistically. Over time, the signal disappears in the noise, and the on-call rotation becomes a form of learned helplessness. Real incidents are discovered by users before they are discovered by the team.

Common causes

Blind operations

Teams that have not developed observability as a discipline often configure alerts as an afterthought. Every metric gets an alert, thresholds are guessed rather than calibrated, and alert correlation - multiple alerts from one underlying cause - is never considered. This approach produces alert storms, not actionable signals.

Good alerting requires deliberate design: alerts should be tied to user-visible symptoms rather than internal metrics, thresholds should be calibrated to real traffic patterns, and correlated alerts should suppress to a single notification. This design requires treating observability as a continuous practice rather than a one-time setup.

Read more: Blind operations

Missing deployment pipeline

A pipeline provides a natural checkpoint for validating monitoring configuration as part of each deployment. Without a pipeline, monitoring is configured manually at deployment time and never revisited in a structured way. Alert thresholds set at initial deployment are never recalibrated as traffic patterns change.

A pipeline that includes monitoring configuration as code - alert thresholds defined alongside the service code they monitor - makes alert configuration a versioned, reviewable artifact rather than a manual configuration that drifts.

Read more: Missing deployment pipeline

How to narrow it down

  1. What percentage of pages this week required action? If less than half required action, the alert signal-to-noise ratio is too low. Start with Blind operations.
  2. Are alert thresholds defined as code or set manually in a UI? Manual threshold configuration drifts and is never revisited. Start with Missing deployment pipeline.
  3. Do alerts fire at the symptom level (user-visible problems) or the metric level (internal system measurements)? Metric-level alerts create alert storms when one root cause affects many metrics. Start with Blind operations.

Ready to fix this? The most common cause is Blind operations. Start with its How to Fix It section for week-by-week steps.

4.2 - Team Burnout and Unsustainable Pace

The team is exhausted. Every sprint is a crunch sprint. There is no time for learning, improvement, or recovery.

What you are seeing

The team is always behind. Sprint commitments are missed or met only through overtime. Developers work evenings and weekends to hit deadlines, then start the next sprint already tired. There is no buffer for unplanned work, so every production incident or stakeholder escalation blows up the plan.

Nobody has time for learning, experimentation, or process improvement. Suggestions like “let’s improve our test suite” or “let’s automate that deployment” are met with “we don’t have time.” The irony is that the manual work those improvements would eliminate is part of what keeps the team too busy.

Attrition risk is high. The most experienced developers leave first because they have options. Their departure increases the load on whoever remains, accelerating the cycle.

Common causes

Thin-Spread Teams

When a small team owns too many products, every developer is stretched across multiple codebases. Context switching consumes 20 to 40 percent of their capacity. The team looks fully utilized but delivers less than a focused team half its size. The utilization trap (“keep everyone busy”) masks the real problem: the team has more responsibilities than it can sustain.

Read more: Thin-Spread Teams

Deadline-Driven Development

When every sprint is driven by an arbitrary deadline, the team never operates at a sustainable pace. There is no recovery period after a crunch because the next deadline starts immediately. Quality is the first casualty, which creates rework, which consumes future capacity, which makes the next deadline even harder to meet. The cycle accelerates until the team collapses.

Read more: Deadline-Driven Development

Unbounded WIP

When there is no limit on work in progress, the team starts many things and finishes few. Every developer juggles multiple items, each getting fragmented attention. The sensation of being constantly busy but never finishing anything is a direct contributor to burnout. The team is working hard on everything and completing nothing.

Read more: Unbounded WIP

Push-Based Work Assignment

When work is assigned to individuals, asking for help carries a cost: it pulls a teammate away from their own assigned stories. So developers struggle alone rather than swarming. Workloads are also uneven because managers cannot precisely predict how long work will take at assignment time. Some people finish early and wait for reassignment; others are chronically overloaded. The overloaded developers cannot refuse new assignments without appearing unproductive, so the pace becomes unsustainable for the people carrying the heaviest loads.

Read more: Push-Based Work Assignment

Velocity as Individual Metric

When individual story points are tracked, developers cannot afford to help each other, take time to learn, or invest in quality. Every hour must produce measurable output. The pressure to perform individually eliminates the slack that teams need to stay healthy. Helping a teammate, mentoring a junior developer, or improving a build script all become career risks because they do not produce points.

Read more: Velocity as Individual Metric

How to narrow it down

  1. Is the team responsible for more products than it can sustain? If developers are spread across many products with constant context switching, the workload exceeds what the team structure can handle. Start with Thin-Spread Teams.
  2. Is every sprint driven by an external deadline? If the team has not had a sprint without deadline pressure in months, the pace is unsustainable by design. Start with Deadline-Driven Development.
  3. Does the team have more items in progress than team members? If WIP is unbounded and developers juggle multiple items, the team is thrashing rather than delivering. Start with Unbounded WIP.
  4. Are individuals measured by story points or velocity? If developers feel pressure to maximize personal output at the expense of collaboration and sustainability, the measurement system is contributing to burnout. Start with Velocity as Individual Metric.
  5. Are workloads distributed unevenly, with some people chronically overloaded while others wait for new assignments? If the team cannot self-balance because work is assigned rather than pulled, the assignment model is driving the unsustainable pace. Start with Push-Based Work Assignment.

Ready to fix this? The most common cause is Thin-Spread Teams. Start with its How to Fix It section for week-by-week steps.


4.3 - When Something Breaks, Nobody Knows What to Do

There are no documented response procedures. Critical knowledge lives in one person’s head. Incidents are improvised every time.

What you are seeing

An alert fires at 2 AM. The on-call engineer looks at the dashboard and sees something is wrong with the payment service, but they have never been involved in a payment service incident before. They know the service is critical. They do not know the recovery procedure, the escalation path, the safe restart sequence, or the architectural context needed to diagnose the problem.

They wake up the one person who knows the payment service. That person is on vacation in a different time zone. They respond and start walking through the steps over a video call, explaining the system while simultaneously trying to diagnose the problem. The incident takes four hours to resolve, two of which were spent on knowledge transfer that should have been documented.

The team conducts a post-mortem. The action item is “document the payment service runbook.” The action item is added to the backlog. It does not get prioritized. Three months later, there is another 2 AM incident and the same knowledge transfer happens again.

Common causes

Knowledge silos

When system knowledge is not externalized into runbooks, architectural documentation, and operational procedures, it disappears when the person who holds it is unavailable. Incident response is the most time-pressured context in which to rediscover missing knowledge. The gap between “what we know collectively” and “what is documented” only becomes visible when the person who fills that gap is not present.

Teams that treat runbook maintenance as part of incident response - updating documentation immediately after resolving an incident, while the context is fresh - gradually close the gap. The runbook improves with every incident rather than remaining stale between rare documentation efforts.

Read more: Knowledge silos

Blind operations

Without adequate observability, diagnosing the cause of an incident requires deep system knowledge rather than reading dashboards. An on-call engineer with good observability can often identify the root cause of an incident from metrics, logs, and traces without needing the one person who understands the system internals. An on-call engineer without observability is flying blind, dependent on tribal knowledge.

Good observability turns incident response from an expert-only activity into something any trained engineer can do from a dashboard. The runbook points at the right metrics; the metrics tell the story.

Read more: Blind operations

Manual deployments

Systems deployed manually often have complex, undocumented operational characteristics. The manual deployment knowledge and the incident response knowledge are often held by the same person - because the person who knows how to deploy a service also knows how it behaves and how to recover it. This concentration of knowledge is a single point of failure.

Read more: Manual deployments

How to narrow it down

  1. Does every service have a runbook that an on-call engineer unfamiliar with the service could follow? If not, incident response requires specific people. Start with Knowledge silos.
  2. Can the on-call engineer determine the likely cause of an incident from dashboards alone? If diagnosing incidents requires deep system knowledge, observability is insufficient. Start with Blind operations.
  3. Is there a single person whose absence would make incident response significantly harder for multiple services? That person is a single point of failure. Start with Knowledge silos.

Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.

4.4 - Production Issues Discovered by Customers

The team finds out about production problems from support tickets, not alerts.

What you are seeing

The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to check. There are no metrics to compare before and after. The team waits. If nobody complains within an hour, they assume the deployment was successful.

When something does go wrong, the team finds out from a customer support ticket, a Slack message from another team, or an executive asking why the site is slow. The investigation starts with SSH-ing into a server and reading raw log files. Hours pass before anyone understands what happened, what caused it, or how many users were affected.

Common causes

Blind Operations

The team has no application-level metrics, no centralized logging, and no alerting. The infrastructure may report that servers are running, but nobody can tell whether the application is actually working correctly. Without instrumentation, the only way to discover a problem is to wait for someone to experience it and report it.

Read more: Blind Operations

Manual Deployments

When deployments involve human steps (running scripts by hand, clicking through a console), there is no automated verification step. The deployment process ends when the human finishes the steps, not when the system confirms it is healthy. Without an automated pipeline that checks health metrics after deploying, verification falls to manual spot-checking or waiting for complaints.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, there is nowhere to integrate automated health checks. A deployment pipeline can include post-deploy verification that compares metrics before and after. Without a pipeline, verification is entirely manual and usually skipped under time pressure.

Read more: Missing Deployment Pipeline

How to narrow it down

  1. Does the team have application-level metrics and alerts? If no, the team has no way to detect problems automatically. Start with Blind Operations.
  2. Is the deployment process automated with health checks? If deployments are manual or automated without post-deploy verification, problems go undetected until users report them. Start with Manual Deployments or Missing Deployment Pipeline.
  3. Does the team check a dashboard after every deployment? If the answer is “sometimes” or “we click through the app manually,” the verification step is unreliable. Start with Blind Operations to build automated verification.

Ready to fix this? The most common cause is Blind Operations. Start with its How to Fix It section for week-by-week steps.

4.5 - Logs Exist but Cannot Be Searched or Correlated

Every service writes logs, but they are not aggregated or queryable. Debugging requires SSH access to individual servers.

What you are seeing

Debugging a production problem requires SSH access to individual servers and manual correlation across log files. An engineer SSHes into the production server, navigates to the log directory, and greps through gigabytes of log files looking for error messages. The logs from three services involved in the failing request are on three different servers with three different log formats. Correlating events into a coherent timeline requires copying relevant lines into a document and sorting by timestamp manually.

Log rotation has pruned most of what might be relevant from two weeks ago when the issue likely started. The logs that exist are unstructured text mixed with stack traces. Field names differ between services: one logs user_id, another logs userId, a third logs uid. A query to find all errors from a specific user in the past hour would take thirty minutes to run manually across all servers.

The team knows this is a problem but treats it as “we need to add a log aggregation system eventually.” Eventually has not arrived. In the meantime, debugging production issues is slow, often incomplete, and dependent on whoever has the institutional knowledge to navigate the logging infrastructure.

Common causes

Blind operations

Unstructured, unaggregated logs are one form of not having instrumented a system for observability. Logs that cannot be searched or correlated are only marginally more useful than no logs at all. Observability requires structured logs with consistent field names, aggregated into a searchable store, with the ability to correlate log events across services by request ID or trace context.

Structured logging requires deliberate adoption: a standard log format, consistent field names, correlation identifiers on every log entry. When these are in place, a query that previously required thirty minutes of manual grepping across servers runs in seconds from a single interface.

Read more: Blind operations

Knowledge silos

Understanding how to navigate the logging infrastructure - which servers hold which logs, what the rotation schedule is, which grep patterns produce useful results - is knowledge that concentrates in the people who have done enough debugging to learn it. New team members cannot effectively debug production issues independently because they do not know the informal map of where things are.

When logs are aggregated into a centralized, searchable system, the knowledge of where to look is built into the tooling. Any team member can write a query without knowing the physical location of log files.

Read more: Knowledge silos

How to narrow it down

  1. Can the team search logs across all services from a single interface? If debugging requires SSH access to individual servers, logs are not aggregated. Start with Blind operations.
  2. Can the team trace a single request across multiple services using a shared correlation ID? If not, distributed debugging is manual assembly work. Start with Blind operations.
  3. Can new team members debug production issues independently, without help from senior engineers? If debugging requires knowing the informal map of log locations and formats, the knowledge is siloed. Start with Knowledge silos.

Ready to fix this? The most common cause is Blind operations. Start with its How to Fix It section for week-by-week steps.

4.6 - Leadership Sees CD as a Technical Nice-to-Have

Management does not understand why CD matters. No budget for tooling. No time allocated for improvement.

What you are seeing

Pipeline improvement work loses to feature delivery every sprint. The team wants to invest in deployment automation, test infrastructure, and pipeline improvements. The engineering manager supports this in principle. But every sprint, when capacity is allocated, the product backlog wins. There are features to ship, commitments to keep, a roadmap to deliver against. Pipeline improvements are real work - weeks of investment - but they do not appear on any roadmap and do not map to revenue-generating features.

When the team escalates to leadership, the response is supportive but non-committal: “Yes, we need to do that. Find a way to fit it in.” The team tries to fit it in - at the margins, in slack time, adjacent to feature work. The improvement work is slow, fragmented, and regularly displaced. Three years in, the pipeline is incrementally better, but the fundamental problems remain.

What is missing is organizational priority. CD adoption requires sustained investment - not a one-time sprint but ongoing capacity allocated to improving the delivery system. Without a sponsor who can protect that capacity from feature demand, improvement work will always lose to delivery pressure.

Common causes

Velocity as individual metric

When management measures progress by story points or feature delivery rate, investment in pipeline infrastructure looks like a reduction in output. A sprint where half the team works on deployment automation produces fewer feature story points than a sprint where everyone delivers features. Leaders optimizing for short-term throughput will consistently deprioritize it.

When lead time and deployment frequency are tracked alongside feature delivery, pipeline investment has a visible ROI. Leadership can see the case for it in the same dashboard they use for feature delivery - and pipeline work stops competing invisibly against features that do show up on a scoreboard.

Read more: Velocity as individual metric

Missing product ownership

Without a product owner who understands that delivery capability is itself a product attribute, pipeline work has no advocate in planning. Features with product owners get prioritized. Infrastructure work without sponsors does not. The team needs someone with organizational standing who can represent improvement work as a priority in the same planning conversation as feature work.

Read more: Missing product ownership

Deadline-driven development

When the organization is organized around fixed delivery dates, any work that does not directly advance the date looks like overhead. CD adoption requires investing in the delivery system itself, which competes with delivering to the schedule. Until management understands that delivery capability is what makes future schedules achievable, the investment will not be protected.

Read more: Deadline-driven development

How to narrow it down

  1. Does management measure and track delivery lead time, deployment frequency, and change fail rate? If not, the measurement system does not reward CD investment. Start with Velocity as individual metric.
  2. Is there an organizational sponsor who advocates for delivery capability improvements in planning? If improvement work has no sponsor, it will always lose to features with sponsors. Start with Missing product ownership.
  3. Is delivery organized around fixed commitment dates? If yes, anything not tied to the date is implicitly deprioritized. Start with Deadline-driven development.

Ready to fix this? The most common cause is Velocity as individual metric. Start with its How to Fix It section for week-by-week steps.

4.7 - Runbooks and Architecture Docs Are Years Out of Date

Deployment procedures, architecture diagrams, and operational runbooks describe a system that no longer matches reality.

What you are seeing

The runbook for the API service describes a deployment process involving a tool the team migrated away from two years ago. The architecture diagram shows four services; there are now eleven. The “how to add a new service” guide assumes a project structure that was refactored in the last rewrite. The documents are not wrong - they were accurate when written - but nobody updated them as the system evolved.

The team has learned to use documentation as a rough starting point and rely on tribal knowledge for the details. Senior engineers know which documents are outdated and which are still accurate. Newer team members cannot make this distinction and waste time following outdated procedures. Incidents that could be resolved in minutes take hours because the runbook does not match the system the on-call engineer is looking at.

The documentation gap compounds over time. Each change that is not documented increases the gap between documentation and reality. Eventually the gap is so large that nobody trusts any documentation, and all knowledge defaults to person-to-person transfer.

Common causes

Knowledge silos

When documentation is the only path from tribal knowledge to shared knowledge, and the team does not value documentation as a practice, knowledge accumulates in people rather than in records. The runbook written under pressure during an incident is the only runbook that gets written. Day-to-day changes that affect operations never get documented because the documentation habit is not part of the development workflow.

Teams that treat documentation as part of the definition of done - the change is not done until it is documented - produce documentation that stays current. Each change author updates the relevant runbooks and architectural records as part of completing the work.

Read more: Knowledge silos

Manual deployments

Systems deployed manually have deployment procedures that are highly contextual, learned by doing, and resistant to documentation. The deployment is a craft practice: the person executing it knows which steps to skip in which situations, which warnings to ignore, and which undocumented behaviors to watch for. Documenting this craft knowledge is difficult because it is tacit.

Automating the deployment process forces documentation into code. The pipeline definition is the authoritative deployment procedure. When the deployment changes, the pipeline definition changes. The code is always current because the code is the process.

Read more: Manual deployments

Snowflake environments

When environments evolve by hand, the gap between documented architecture and the actual running architecture grows with every undocumented change. An architecture diagram drawn at the last major redesign does not show the database added directly to production for a performance fix, the caching layer added informally, or the service split that happened in a hackathon. Infrastructure as code makes the infrastructure itself the documentation.

Read more: Snowflake environments

How to narrow it down

  1. Can the on-call engineer follow the runbook for a critical service without help from someone who knows the service? If not, the runbook is out of date. Start with Knowledge silos.
  2. Is the deployment procedure defined as pipeline code or as written documentation? Written documentation drifts; pipeline code is the process itself. Start with Manual deployments.
  3. Does the architecture documentation match the current production system? If the diagram and the reality diverge, the environments were changed without corresponding documentation. Start with Snowflake environments.

Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.

4.8 - Production Problems Are Discovered Hours or Days Late

Issues in production are not discovered until users report them. There is no automated detection or alerting.

What you are seeing

A deployment goes out on Tuesday. On Thursday, a support ticket comes in: a feature is broken for a subset of users. The team investigates and discovers the problem was introduced in Tuesday’s deploy. For two days, users experienced the issue while the team had no idea.

Or a performance degradation appears gradually. Response times creep up over a week. Nobody notices until a customer complains or a business metric drops. The team checks the dashboards and sees the degradation started after a specific deploy, but the deploy was days ago and the trail is cold.

The team deploys carefully and then “watches for a while.” Watching means checking a few URLs manually or refreshing a dashboard for 15 minutes. If nothing obviously breaks in that window, the deployment is declared successful. Problems that manifest slowly, affect a subset of users, or appear under specific conditions go undetected.

Common causes

Blind Operations

When the team has no monitoring, no alerting, and no aggregated logging, production is a black box. The only signal that something is wrong comes from users, support staff, or business reports. The team cannot detect problems because they have no instruments to detect them with. Adding observability (metrics, structured logging, distributed tracing, alerting) gives the team eyes on production.

Read more: Blind Operations

Undone Work

When the team’s definition of done does not include post-deployment verification, nobody is responsible for confirming that the deployment is healthy. The story is “done” when the code is merged or deployed, not when it is verified in production. Health checks, smoke tests, and canary analysis are not part of the workflow because the workflow ends before production.

Read more: Undone Work

Manual Deployments

When deployments are manual, there is no automated post-deploy verification step. An automated pipeline can include health checks, smoke tests, and rollback triggers as part of the deployment sequence. A manual deployment ends when the human finishes the runbook. Whether the deployment is actually healthy is a separate question that may or may not get answered.

Read more: Manual Deployments

How to narrow it down

  1. Does the team have production monitoring with alerting thresholds? If not, the team cannot detect problems that users do not report. Start with Blind Operations.
  2. Does the team’s definition of done include post-deploy verification? If stories are closed before production health is confirmed, nobody owns the detection step. Start with Undone Work.
  3. Does the deployment process include automated health checks? If deployments end when the human finishes the script, there is no automated verification. Start with Manual Deployments.

Ready to fix this? The most common cause is Blind Operations. Start with its How to Fix It section for week-by-week steps.


4.9 - It Works on My Machine

Code that works in one developer’s environment fails in another, in CI, or in production. Environment differences make results unreproducible.

What you are seeing

A developer runs the application locally and everything works. They push to CI and the build fails. Or a teammate pulls the same branch and gets a different result. Or a bug report comes in that nobody can reproduce locally.

The team spends hours debugging only to discover the issue is environmental: a different Node version, a missing system library, a different database encoding, or a service running on the developer’s machine that is not available in CI. The code is correct. The environments are different.

New team members experience this acutely. Setting up a development environment takes days of following an outdated wiki page, asking teammates for help, and discovering undocumented dependencies. Every developer’s machine accumulates unique configuration over time, making “works on my machine” a common refrain and a useless debugging signal.

Common causes

Snowflake Environments

When development environments are set up manually and maintained individually, each developer’s machine becomes unique. One developer installed Python 3.9, another has 3.11. One has PostgreSQL 14, another has 15. These differences are invisible until someone hits a version-specific behavior. Reproducible, containerized development environments eliminate the variance by ensuring every developer works in an identical setup.

Read more: Snowflake Environments

Manual Deployments

When environment setup is a manual process documented in a wiki or README, it is never followed identically. Each developer interprets the instructions slightly differently, installs a slightly different version, or skips a step that seems optional. The manual process guarantees divergence over time. Infrastructure as code and automated setup scripts ensure consistency.

Read more: Manual Deployments

Tightly Coupled Monolith

When the application has implicit dependencies on its environment (specific file paths, locally running services, system-level configuration), it is inherently sensitive to environmental differences. Well-designed code with explicit, declared dependencies works the same way everywhere. Code that reaches into its runtime environment for undeclared dependencies works only where those dependencies happen to exist.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Do all developers use the same OS, runtime versions, and dependency versions? If not, environment divergence is the most likely cause. Start with Snowflake Environments.
  2. Is the development environment setup automated or manual? If it is a wiki page that takes a day to follow, the manual process creates the divergence. Start with Manual Deployments.
  3. Does the application depend on local services, file paths, or system configuration that is not declared in the codebase? If the application has implicit environmental dependencies, it will behave differently wherever those dependencies differ. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.

5 - Symptoms for Developers

Dysfunction symptoms grouped by the friction developers and tech leads experience - from daily coding pain to team-level delivery patterns.

These are the symptoms you experience while writing, testing, and shipping code. Some you feel personally. Others you see as patterns across the team. If something on this list sounds familiar, follow the link to find what is causing it and how to fix it.

Pushing code and getting feedback

Tests getting in the way

  • Tests Randomly Pass or Fail - You click rerun without investigating because flaky failures are so common. The team ignores failures by default, which masks real regressions.
  • Refactoring Breaks Tests - You rename a method or restructure a class and 15 tests fail, even though the behavior is correct. Technical debt accumulates because cleanup is too expensive.
  • Test Suite Is Too Slow to Run - Running tests locally is so slow that you skip it and push to CI instead, trading fast feedback for a longer loop.
  • High Coverage but Tests Miss Defects - Coverage is above 80% but bugs still make it to production. The tests check that code runs, not that it works correctly.

Integrating and merging

Deploying and releasing

Environment and production surprises

See Learning Paths for a structured reading sequence if you want a guided path through diagnosis and fixes.

6 - Symptoms for Managers

Dysfunction symptoms grouped by business impact - unpredictable delivery, quality, and team health.

These are the symptoms that show up in sprint reviews, quarterly planning, and 1-on-1s. They manifest as missed commitments, quality problems, and retention risk.

Unpredictable delivery

Quality reaching customers

Coordination overhead

Team health and retention

See Learning Paths for a structured path from diagnosis to building a case for change.

What to do next

If these symptoms sound familiar, these resources can help you build a case for change and find a starting point: