Integration and Feedback Problems
Symptoms related to work-in-progress, integration pain, review bottlenecks, and feedback speed.
These symptoms indicate problems with how work flows through your team. When integration is
deferred, feedback is slow, or work piles up, the team stays busy without finishing things.
Each page describes what you are seeing and links to the anti-patterns most likely causing it.
Browse by category
- Integration and Pipeline - Painful merges, slow pipelines, PR bottlenecks, feedback delays
- Work Management and Flow - Too much WIP, long cycle times, blocked work, dependency coordination
- Developer Experience - Painful setup, inadequate tooling, environment friction
- Team and Knowledge - Team instability, knowledge silos, missing shared practices
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Team Workflow Anti-Patterns,
Branching and Integration Anti-Patterns
Related guides: Trunk-Based Development,
Work Decomposition,
Limiting WIP
1 - Integration and Pipeline Problems
Code integration, merging, pipeline speed, and feedback loop problems.
Symptoms related to how code gets integrated, how the pipeline processes changes, and how
fast the team gets feedback.
1.1 - Every Change Rebuilds the Entire Repository
A single repository with multiple applications and no selective build tooling. Any commit triggers a full rebuild of everything.
What you are seeing
The CI build takes 45 minutes for every commit because the pipeline rebuilds every application and runs every test regardless of what changed. The team chose a monorepo for good reasons - code sharing is simpler, cross-cutting changes are atomic, and dependency management is more coherent - but the pipeline has no awareness of what actually changed. Changing a comment in Service A triggers a full rebuild of Services B, C, D, and E.
Developers have adapted by batching changes to reduce the number of CI runs they wait through. One CI run per hour instead of one per commit. The batching reintroduces the integration problems the monorepo was supposed to solve: multiple changes combined in a single commit lose the ability to bisect failures to any individual change.
The build system treats the entire repository as a single unit. Service owners have added scripts to skip unmodified services, but the scripts are fragile and not consistently maintained. The CI system was not designed for selective builds, so every workaround is an unsupported hack on top of an ill-fitting tool.
Common causes
Missing deployment pipeline
Pipelines that understand which services changed - using build tools that model the dependency graph or change detection based on file paths - can selectively build and test only what was affected by a commit. Without this investment, pipelines treat the monorepo as a single unit and rebuild everything.
Tools like Nx, Bazel, or Turborepo provide dependency graph awareness for monorepos. A pipeline built on these tools builds only what needs to be rebuilt and runs only the tests that could be affected by the change. Feedback loops shorten from 45 minutes to 5.
Read more: Missing deployment pipeline
Manual deployments
When deployment is manual, there is no automated mechanism to determine which services changed and which need to be deployed. Manual review determines what to deploy, which is slow and inconsistent. Inconsistency leads to either over-deploying (deploying everything to be safe) or under-deploying (missing services that changed).
Automated deployment pipelines with change detection deploy exactly the services that changed, with evidence of what changed and why.
Read more: Manual deployments
How to narrow it down
- Does the pipeline build and test only the services affected by a change? If every commit triggers a full rebuild, change detection is not implemented. Start with Missing deployment pipeline.
- How long does a typical CI run take? If it takes more than 10 minutes regardless of what changed, the pipeline is not leveraging the monorepo’s dependency information. Start with Missing deployment pipeline.
- Can the team deploy a single service from the monorepo without triggering deployments of all services? If not, deployment automation does not understand the monorepo structure. Start with Manual deployments.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
1.2 - Feedback Takes Hours Instead of Minutes
The time from making a change to knowing whether it works is measured in hours, not minutes. Developers batch changes to avoid waiting.
What you are seeing
A developer makes a change and wants to know if it works. They push to CI and wait 45 minutes for
the pipeline. Or they open a PR and wait two days for a review. Or they deploy to staging and wait
for a manual QA pass that happens next week. By the time feedback arrives, the developer has moved
on to something else.
The slow feedback changes developer behavior. They batch multiple changes into a single commit to
avoid waiting multiple times. They skip local verification and push larger, less certain changes.
They start new work before the previous change is validated, juggling multiple incomplete tasks.
When feedback finally arrives and something is wrong, the developer must context-switch back. The
mental model from the original change has faded. Debugging takes longer because the developer is
working from memory rather than from active context. If multiple changes were batched, the
developer must untangle which one caused the failure.
Common causes
Inverted Test Pyramid
When most tests are slow E2E tests, the test feedback loop is measured in tens of minutes rather
than seconds. Unit tests provide feedback in seconds. E2E tests take minutes or hours. A team with
a fast unit test suite can verify a change in under a minute. A team whose testing relies on E2E
tests cannot get feedback faster than those tests can run.
Read more: Inverted Test Pyramid
Integration Deferred
When the team does not integrate frequently (at least daily), the feedback loop for integration
problems is as long as the branch lifetime. A developer working on a two-week branch does not
discover integration conflicts until they merge. Daily integration catches conflicts within hours.
Continuous integration catches them within minutes.
Read more: Integration Deferred
Manual Testing Only
When there are no automated tests, the only feedback comes from manual verification. A developer
makes a change and must either test it manually themselves (slow) or wait for someone else to test
it (slower). Automated tests provide feedback in the pipeline without requiring human effort or
scheduling.
Read more: Manual Testing Only
Long-Lived Feature Branches
When pull requests wait days for review, the code review feedback loop dominates total cycle time.
A developer finishes a change in two hours, then waits two days for review. The review feedback
loop is 24 times longer than the development time. Long-lived branches produce large PRs, and
large PRs take longer to review. Fast feedback requires fast reviews, which requires small PRs,
which requires short-lived branches.
Read more: Long-Lived Feature Branches
Manual Regression Testing Gates
When every change must pass through a manual QA gate, the feedback loop includes human scheduling.
The QA team has a queue. The change waits in line. When the tester gets to it, days have passed.
Automated testing in the pipeline replaces this queue with instant feedback.
Read more: Manual Regression Testing Gates
How to narrow it down
- How fast can the developer verify a change locally? If the local test suite takes more than
a few minutes, the test strategy is the bottleneck. Start with
Inverted Test Pyramid.
- How frequently does the team integrate to main? If developers work on branches for days
before integrating, the integration feedback loop is the bottleneck. Start with
Integration Deferred.
- Are there automated tests at all? If the only feedback is manual testing, the lack of
automation is the bottleneck. Start with
Manual Testing Only.
- How long do PRs wait for review? If review turnaround is measured in days, the review
process is the bottleneck. Start with
Long-Lived Feature Branches.
- Is there a manual QA gate in the pipeline? If changes wait in a QA queue, the manual gate
is the bottleneck. Start with
Manual Regression Testing Gates.
Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.
Related Content
1.3 - Merging Is Painful and Time-Consuming
Integration is a dreaded, multi-day event. Teams delay merging because it is painful, which makes the next merge even worse.
What you are seeing
A developer has been working on a feature branch for two weeks. They open a pull request and
discover dozens of conflicts across multiple files. Other developers have changed the same areas
of the codebase. Resolving the conflicts takes a full day. Some conflicts are straightforward
(two people edited adjacent lines), but others are semantic (two people changed the same
function’s behavior in different ways). The developer must understand both changes to merge
correctly.
After resolving conflicts, the tests fail. The merged code compiles but does not work because the
two changes are logically incompatible. The developer spends another half-day debugging the
interaction. By the time the branch is merged, the developer has spent more time integrating than
they spent building the feature.
The team knows merging is painful, so they delay it. The delay makes the next merge worse because
more code has diverged. The cycle repeats until someone declares a “merge day” and the team spends
an entire day resolving accumulated drift.
Common causes
Long-Lived Feature Branches
When branches live for weeks or months, they accumulate divergence from the main line. The longer
the branch lives, the more changes happen on main that the branch does not include. At merge time,
all of that divergence must be reconciled at once. A branch that is one day old has almost no
conflicts. A branch that is two weeks old may have dozens.
Read more: Long-Lived Feature Branches
Integration Deferred
When the team does not practice continuous integration (integrating to main at least daily), each
developer’s work diverges independently. The build may be green on each branch but broken when
branches combine. CI means integrating continuously, not running a build server. Without frequent
integration, merge pain is inevitable.
Read more: Integration Deferred
Monolithic Work Items
When work items are too large to complete in a day or two, developers must stay on a branch for
the duration. A story that takes a week forces a week-long branch. Breaking work into smaller
increments that can be integrated daily eliminates the divergence window that causes painful
merges.
Read more: Monolithic Work Items
How to narrow it down
- How long do branches typically live before merging? If branches live longer than two days,
the branch lifetime is the primary driver of merge pain. Start with
Long-Lived Feature Branches.
- Does the team integrate to main at least once per day? If developers work in isolation for
days before integrating, they are not practicing continuous integration regardless of whether a
CI server exists. Start with
Integration Deferred.
- How large are the typical work items? If stories take a week or more, the work
decomposition forces long branches. Start with
Monolithic Work Items.
Ready to fix this? The most common cause is Long-Lived Feature Branches. Start with its How to Fix It section for week-by-week steps.
Related Content
1.4 - Each Language Has Its Own Ad Hoc Pipeline
Services in five languages with five build tools and no shared pipeline patterns. Each service is a unique operational snowflake.
What you are seeing
The Java service has a Jenkins pipeline set up four years ago. The Python service has a GitHub Actions workflow written by a consultant. The Go service has a Makefile. The Node.js service deploys from a developer’s laptop. The Ruby service has no deployment automation at all. Each service is a different discipline, maintained by whoever last touched it.
Onboarding a new engineer requires learning five different deployment systems. Fixing a security vulnerability in the dependency scanning step requires five separate changes across five pipeline definitions, each with different syntax. A compliance requirement that all services log deployment events requires five separate implementations, each time reinventing the pattern.
The team knows consolidation would help but cannot agree on a standard. The Java developers prefer their workflow. The Python developers prefer theirs. The effort to migrate any service to a common pattern feels risky because the current approach, however ad hoc, is known to work.
Common causes
Missing deployment pipeline
Without an organizational standard for pipeline design, each team or individual who sets up a service makes an independent choice based on personal familiarity. Establishing a standard pipeline pattern - even a minimal one - gives new services a starting point and gives existing services a target to migrate toward. Each service that adopts the standard is one fewer ad hoc pipeline to maintain separately.
Read more: Missing deployment pipeline
Knowledge silos
Each pipeline is understood only by the person who built it. Changes require that person. Debugging requires that person. When that person leaves, the pipeline becomes a black box that nobody wants to touch. The knowledge of “how the Ruby service deploys” is not shared across the team.
When pipeline patterns are standardized and documented, any team member can understand, debug, and improve any service’s pipeline. The knowledge is in the pattern, not in the person.
Read more: Knowledge silos
Manual deployments
Services that start with manual deployment accumulate automation piecemeal, in whatever form the person adding automation prefers. Without a standard, each automation effort produces a different result. The accumulation of five different automation approaches is harder to maintain than one standard approach applied to five services.
Read more: Manual deployments
How to narrow it down
- Does the team have a standard pipeline pattern that all services follow? If each service has a unique pipeline structure, start with establishing the standard. Start with Missing deployment pipeline.
- Can any engineer on the team deploy any service? If deploying a specific service requires the person who set it up, the pipeline knowledge is siloed. Start with Knowledge silos.
- Are there services with no deployment automation at all? Start with those services. Start with Manual deployments.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
1.5 - Pull Requests Sit for Days Waiting for Review
Pull requests queue up and wait. Authors have moved on by the time feedback arrives.
What you are seeing
A developer opens a pull request and waits. Hours pass. A day passes. They ping someone in chat.
Eventually, comments arrive, but the author has moved on to something else and has to reload
context to respond. Another round of comments. Another wait. The PR finally merges two or three
days after it was opened.
The team has five or more open PRs at any time. Some are days old. Developers start new work
while they wait, which creates more PRs, which creates more review load, which slows reviews
further.
Common causes
Long-Lived Feature Branches
When developers work on branches for days, the resulting PRs are large. Large PRs take longer to
review because reviewers need more time to understand the scope of the change. A 300-line PR is
daunting. A 50-line PR takes 10 minutes. The branch length drives the PR size, which drives the
review delay.
Read more: Long-Lived Feature Branches
Knowledge Silos
When only specific individuals can review certain areas of the codebase, those individuals become
bottlenecks. Their review queue grows while other team members who could review are not
considered qualified. The constraint is not review capacity in general but review capacity for
specific code areas concentrated in too few people.
Read more: Knowledge Silos
Push-Based Work Assignment
When work is assigned to individuals, reviewing someone else’s code feels like a distraction
from “my work.” Every developer has their own assigned stories to protect. Helping a teammate
finish their work by reviewing their PR competes with the developer’s own assignments. The
incentive structure deprioritizes collaboration.
Read more: Push-Based Work Assignment
How to narrow it down
- Are PRs larger than 200 lines on average? If yes, the reviews are slow because the
changes are too large to review quickly. Start with
Long-Lived Feature Branches
and the work decomposition that feeds them.
- Are reviews waiting on specific individuals? If most PRs are assigned to or waiting on
one or two people, the team has a knowledge bottleneck. Start with
Knowledge Silos.
- Do developers treat review as lower priority than their own coding work? If yes, the
team’s norms do not treat review as a first-class activity. Start with
Push-Based Work Assignment and
establish a team working agreement that reviews happen before starting new work.
Ready to fix this? The most common cause is Long-Lived Feature Branches. Start with its How to Fix It section for week-by-week steps.
Related Content
1.6 - The Team Resists Merging to the Main Branch
Developers feel unsafe committing to trunk. Feature branches persist for days or weeks before merge.
What you are seeing
Everyone still has long-lived feature branches. The team agreed to try trunk-based development, but three sprints later “merge to trunk when the feature is done” is the informal rule. Branches live for days or weeks. When developers finally merge, there are conflicts. The conflicts take hours to resolve. Everyone agrees this is a problem but nobody knows how to break the cycle.
The core objection is safety: “I’m not going to push half-finished code to main.” This is a reasonable concern in the current environment. The main branch has no automated test suite that would catch regressions quickly. There is no feature flag infrastructure to let partially-built features live in production in a dormant state. Trunk-based development feels reckless because the prerequisites for it are not in place.
The team is not wrong to feel unsafe. They are wrong to believe long-lived branches are safer. The longer a branch lives, the larger the eventual merge, the more conflicts, and the more risk concentrated into the merge event. The fear of merging to trunk is rational, but the response makes the underlying problem worse.
Common causes
Manual testing only
Without a fast automated test suite, merging to trunk means accepting unknown risk. Developers protect themselves by deferring the merge until they have done sufficient manual verification - which takes days. Teams with a fast automated suite that runs in minutes find the resistance dissolves. When a broken commit is caught in five minutes, committing to trunk stops feeling reckless and starts feeling like the obvious way to work.
Read more: Manual testing only
Manual regression testing gates
When a manual QA phase gates each release, trunk is never truly releasable. Merging to trunk does not mean the code is production-ready - it still has to pass manual testing. This reduces the psychological pressure to keep trunk releasable. The team does not feel the cost of a broken trunk immediately because it is not the signal they monitor.
When trunk is the thing that gates production, a broken trunk is a fire drill - every minute it is broken is a minute the team cannot ship. That urgency is what makes developers take frequent integration seriously. Without it, the resistance to committing to trunk has no natural counter-pressure.
Read more: Manual regression testing gates
Long-lived feature branches
Feature branch habits are self-reinforcing. Teams with ingrained feature branch practices have calibrated their workflows, tools, and feedback loops to the batching model. Switching to trunk-based development requires changing all of those workflows simultaneously, which is disorienting.
The habits that make long-lived branches feel safe - waiting to merge until the feature is complete, doing final testing on the branch, getting full review before touching trunk - are the same habits that keep the resistance alive. Small, deliberate workflow changes - reviewing smaller units, integrating while work is in progress, getting feedback from the pipeline rather than a gated review - reduce the resistance step by step rather than requiring an all-at-once mindset shift.
Read more: Long-lived feature branches
Monolithic work items
Large work items cannot be integrated to trunk incrementally without deliberate design. A story that takes three weeks requires either keeping a branch for three weeks, or learning to hide in-progress work behind feature flags, dark launch patterns, or abstraction layers. Without those techniques, large items force long-lived branches.
Decomposing work into smaller items that can be integrated to trunk in a day or two makes trunk-based development natural rather than effortful.
Read more: Monolithic work items
How to narrow it down
- Does the team have an automated test suite that runs in under 10 minutes? If not, the feedback loop needed to make frequent trunk commits safe does not exist. Start with Manual testing only.
- Is trunk always releasable? If releases require a manual QA phase regardless of trunk state, there is no incentive to keep trunk releasable. Start with Manual regression testing gates.
- Do work items typically take more than two days to complete? If items take longer than two days, integrating to trunk daily requires techniques for hiding in-progress work. Start with Monolithic work items.
Ready to fix this? The most common cause is Long-lived feature branches. Start with its How to Fix It section for week-by-week steps.
1.7 - Pipelines Take Too Long
Pipelines take 30 minutes or more. Developers stop waiting and lose the feedback loop.
What you are seeing
A developer pushes a commit and waits. Thirty minutes pass. An hour. The pipeline is still
running. The developer context-switches to another task, and by the time the pipeline finishes
(or fails), they have moved on mentally. If the build fails, they must reload context, figure out
what went wrong, fix it, push again, and wait another 30 minutes.
Developers stop running the full test suite locally because it takes too long. They push and hope.
Some developers batch multiple changes into a single push to avoid waiting multiple times, which
makes failures harder to diagnose. Others skip the pipeline entirely for small changes and merge
with only local verification.
The pipeline was supposed to provide fast feedback. Instead, it provides slow feedback that
developers work around rather than rely on.
Common causes
Inverted Test Pyramid
When most of the test suite consists of end-to-end or integration tests rather than unit tests,
the pipeline is dominated by slow, resource-intensive test execution. E2E tests launch browsers,
spin up services, and wait for network responses. A test suite with thousands of unit tests (that
run in seconds) and a small number of targeted E2E tests is fast. A suite with hundreds of E2E
tests and few unit tests is slow by construction.
Read more: Inverted Test Pyramid
Snowflake Environments
When pipeline environments are not standardized or reproducible, builds include extra time for
environment setup, dependency installation, and configuration. Caching is unreliable because the
environment state is unpredictable. A pipeline that spends 15 minutes downloading dependencies
because there is no reliable cache layer is slow for infrastructure reasons, not test reasons.
Read more: Snowflake Environments
Tightly Coupled Monolith
When the codebase has no clear module boundaries, every change triggers a full rebuild and a full
test run. The pipeline cannot selectively build or test only the affected components because the
dependency graph is tangled. A change to one module might affect any other module, so the pipeline
must verify everything.
Read more: Tightly Coupled Monolith
Manual Regression Testing Gates
When the pipeline includes a manual testing phase, the wall-clock time from push to green
includes human wait time. A pipeline that takes 10 minutes to build and test but then waits two
days for manual sign-off is not a 10-minute pipeline. It is a two-day pipeline with a 10-minute
automated prefix.
Read more: Manual Regression Testing Gates
How to narrow it down
- What percentage of pipeline time is spent running tests? If test execution dominates and
most tests are E2E or integration tests, the test strategy is the bottleneck. Start with
Inverted Test Pyramid.
- How much time is spent on environment setup and dependency installation? If the pipeline
spends significant time on infrastructure before any tests run, the build environment is the
bottleneck. Start with
Snowflake Environments.
- Can the pipeline build and test only the changed components? If every change triggers a
full rebuild, the architecture prevents selective testing. Start with
Tightly Coupled Monolith.
- Does the pipeline include any manual steps? If a human must approve or act before the
pipeline completes, the human is the bottleneck. Start with
Manual Regression Testing Gates.
Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.
Related Content
1.8 - The Team Is Caught Between Shipping Fast and Not Breaking Things
A cultural split between shipping speed and production stability. Neither side sees how CD resolves the tension.
What you are seeing
The team is divided. Developers want to ship often and trust that fast feedback will catch problems. Operations and on-call engineers want stability and fewer changes to reason about during incidents. Both positions are defensible. The conflict is real and recurs in every conversation about deployment frequency, change windows, and testing requirements.
The team has reached an uncomfortable equilibrium. Developers batch changes to deploy less often, which partially satisfies the stability concern but creates larger, riskier releases. Operations accepts the change window constraints, which gives them predictability but means the team cannot respond quickly to urgent fixes. Nobody is getting what they actually want.
What neither side sees is that the conflict is a symptom of the current deployment system, not an inherent tradeoff. Deployments are risky because they are large and infrequent. They are large and infrequent because of the process and tooling around them. A system that makes deployments small, fast, automated, and reversible changes the equation: frequent small changes are less risky than infrequent large ones.
Common causes
Manual deployments
Manual deployments are slow and error-prone, which makes the stability concern rational. When deployments require hours of careful manual execution, limiting their frequency does reduce overall human error exposure. The stability faction’s instinct is correct given the current deployment mechanism.
Automated deployments that execute the same steps identically every time eliminate most human error from the deployment process. When the deployment mechanism is no longer a variable, the speed-vs-stability argument shifts from “how often should we deploy” to “how good is the code we are deploying” - a question both sides can agree on.
Read more: Manual deployments
Missing deployment pipeline
Without a pipeline with automated tests, health checks, and rollback capability, the stability concern is valid. Each deployment is a manual, unverified process that could go wrong in novel ways. A pipeline that enforces quality gates before production and detects problems immediately after deployment changes the risk profile of frequent deployments fundamentally.
When the team can deploy with high confidence and roll back automatically if something goes wrong, the frequency of deployments stops being a risk factor. The risk per deployment is low when each deployment is small, tested, and reversible.
Read more: Missing deployment pipeline
Pressure to skip testing
When testing is perceived as an obstacle to shipping speed, teams cut tests to go faster. This worsens stability, which intensifies the stability faction’s resistance to more frequent deployments. The speed-vs-stability tension is partly created by the belief that quality and speed are in opposition - a belief reinforced by the experience of shipping faster by skipping tests and then dealing with the resulting production incidents.
Read more: Pressure to skip testing
Deadline-driven development
When velocity is measured by features shipped to a deadline, every hour spent on test infrastructure, deployment automation, or operational excellence is an hour not spent on the deadline. The incentive structure creates the tension by rewarding speed while penalizing the investment that would make speed safe.
Read more: Deadline-driven development
How to narrow it down
- Is the deployment process automated and consistent? If deployments are manual and variable, the stability concern is about process risk, not just code risk. Start with Manual deployments.
- Does the team have automated testing and fast rollback? Without these, deploying frequently is genuinely riskier than deploying infrequently. Start with Missing deployment pipeline.
- Does management pressure the team to ship faster by cutting testing? If yes, the tension is being created from above rather than within the team. Start with Pressure to skip testing.
Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.
2 - Work Management and Flow Problems
WIP overload, cycle time, planning bottlenecks, and dependency coordination problems.
Symptoms related to how work is planned, prioritized, and moved through the delivery process.
2.1 - Blocked Work Sits Idle Instead of Being Picked Up
When a developer is stuck, the item waits with them rather than being picked up by someone else. The team has no mechanism for redistributing blocked work.
What you are seeing
A developer opens a ticket on Monday and hits a blocker by Tuesday - a missing dependency, an
unclear requirement, an area of the codebase they don’t understand well. They flag it in standup.
The item sits in “in progress” for two more days while they work around the blocker or wait for
it to resolve. Nobody picks it up.
The board shows items stuck in the same column for days. Blockers get noted but rarely acted on
by other team members. At sprint review, several items are “almost done” but not finished - each
stalled at a different blocker that a teammate could have resolved quickly.
Common causes
Push-Based Work Assignment
When work belongs to an assigned individual, nobody else feels authorized to touch it. Other team
members see the blocked item but do not pick it up because it is “someone else’s story.” The
assigned developer is expected to resolve their own blockers, even when a teammate could clear
the issue in minutes. The team’s norm is individual ownership, so swarming - the highest-value
response to a blocker - never happens.
Read more: Push-Based Work Assignment
Knowledge Silos
When only the assigned developer understands the relevant area of the codebase, other team
members cannot help even when they want to. The blocker persists until the assigned person
resolves it because nobody else has the context to take over. Swarming is not possible because
the knowledge needed to continue the work lives in one person.
Read more: Knowledge Silos
How to narrow it down
- Does the blocked item sit with the assigned developer rather than being picked up by
someone else? If teammates see the blocker flagged in standup and do not act on it, the
norm of individual ownership is preventing swarming. Start with
Push-Based Work Assignment.
- Could a teammate help if they had more context about that area of the codebase? If
knowledge is too concentrated to allow handoff, silos are compounding the problem. Start with
Knowledge Silos.
Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.
Related Content
2.2 - Completed Stories Don't Match What Was Needed
Stories are marked done but rejected at review. The developer built what the ticket described, not what the business needed.
What you are seeing
A developer finishes a story and moves it to done. The product owner reviews it and sends it
back: “This isn’t quite what I meant.” The implementation is technically correct - it satisfies
the acceptance criteria as written - but it misses the point of the work. The story re-enters
the sprint as rework, consuming time that was not planned for.
This happens repeatedly with the same pattern: the developer built exactly what was described
in the ticket, but the ticket did not capture the underlying need. Stories that seemed clearly
defined come back with significant revisions. The team’s velocity looks reasonable but a
meaningful fraction of that work is being done twice.
Common causes
Push-Based Work Assignment
When work is assigned rather than pulled, the developer receives a ticket without the context
behind it. They were not in the conversation where the need was identified, the priority was
established, or the trade-offs were discussed. They implement the ticket as written and deliver
something that satisfies the description but not the intent.
In a pull system, developers engage with the backlog before picking up work. Refinement
discussions and Three Amigos sessions happen with the people who will actually do the work, not
with whoever happens to be assigned later. The developer who pulls a story understands why it is
at the top of the backlog and what outcome it is trying to achieve.
Read more: Push-Based Work Assignment
Ambiguous Requirements
When acceptance criteria are written as checklists rather than as descriptions of user outcomes,
they can be satisfied without delivering value. A story that specifies “add a confirmation dialog”
can be implemented in a way that technically adds the dialog but makes it unusable. Requirements
that do not express the user’s goal leave room for implementations that miss the point.
Read more: Work Decomposition
How to narrow it down
- Did the developer have any interaction with the product owner or user before starting the
story? If the developer received only a ticket with no conversation about context or intent,
the assignment model is isolating them from the information they need. Start with
Push-Based Work Assignment.
- Are the acceptance criteria expressed as user outcomes or as implementation checklists?
If criteria describe what to build rather than what the user should be able to do, the
requirements do not encode intent. Start with
Work Decomposition and
look at how stories are written and refined.
Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.
Related Content
2.3 - Stakeholders See Working Software Only at Release Time
There is no cadence for incremental demos. Feedback on what was built arrives months after decisions were made.
What you are seeing
Stakeholders do not see working software until a feature is finished. The team works for six weeks on a new feature, demonstrates it at the sprint review, and the response is: “This is good, but what we actually needed was slightly different. Can we change the navigation so it does X? And actually, we do not need this section at all.” Six weeks of work needs significant rethinking. The changes are scoped as follow-on work for the next planning cycle.
The problem is not that stakeholders gave bad requirements. It is that requirements look different when demonstrated as working software rather than described in user stories. Stakeholders genuinely did not know what they wanted until they saw what they said they wanted. This is normal and expected. The system that would make this feedback cheap - frequent demonstrations of small working increments - is not in place.
When stakeholder feedback arrives months after decisions, course corrections are expensive. Architecture that needs to change has been built on top of for months. The initial decisions have become load-bearing walls. Rework is disproportionate to the insight that triggered it.
Common causes
Monolithic work items
Large work items are not demonstrable until they are complete. A feature that takes six weeks cannot be shown incrementally because it is not useful in partial form. Stakeholders see nothing for six weeks and then see everything at once.
Small vertical slices can be demonstrated as soon as they are done - sometimes multiple times per week. Each slice is a unit of working, demonstrable software that stakeholders can evaluate and respond to while the team is still in the context of that work.
Read more: Monolithic work items
Horizontal slicing
When work is organized by technical layer, nothing is demonstrable until all layers are complete. An API layer with no UI and a UI component that calls no API are both invisible to stakeholders. The feature exists in pieces that stakeholders cannot evaluate individually.
Vertical slices deliver thin but complete functionality that stakeholders can actually use. Each slice has a visible outcome rather than a technical contribution to a future visible outcome.
Read more: Horizontal slicing
Undone work
When the definition of “done” does not include deployed and available for stakeholder review, work piles up as “done but not shown.” The sprint review demonstrates a batch of completed work rather than continuously integrated increments. The delay between completion and review is the source of the feedback lag.
When done means deployed - and the team can demonstrate software in a production-like environment at any sprint review - the feedback loop tightens to the sprint cadence rather than the release cadence.
Read more: Undone work
Deadline-driven development
When delivery is organized around fixed dates rather than continuous value delivery, stakeholder checkpoints are scheduled at release boundaries. The mid-quarter check-in is a status update, not a demonstration of working software. Stakeholders’ ability to redirect the team’s work is limited to the brief window around each release.
Read more: Deadline-driven development
How to narrow it down
- Can the team demonstrate working software every sprint, not just at release? If demos require a release, work is batched too long. Start with Undone work.
- Do stories regularly take more than one sprint to complete? If features are too large to show incrementally, start with Monolithic work items.
- Are stories organized by technical layer? If the UI team and the API team must both finish before anything can be demonstrated, start with Horizontal slicing.
Ready to fix this? The most common cause is Monolithic work items. Start with its How to Fix It section for week-by-week steps.
2.4 - Sprint Planning Is Dominated by Dependency Negotiation
Teams can’t start work until another team finishes something. Planning sessions map dependencies rather than commit to work.
What you are seeing
Sprint planning takes hours. Half the time is spent mapping dependencies: Team A cannot start story X until Team B delivers API Y. Team B cannot deliver that until Team C finishes infrastructure work Z. The board fills with items in “blocked” status before the sprint begins. Developers spend Monday through Wednesday waiting for upstream deliverables and then rush everything on Thursday and Friday.
The dependency graph is not stable. It changes every sprint as new work surfaces new cross-team requirements. Planning sessions produce a list of items the team hopes to complete, contingent on factors outside their control. Commitments are made with invisible asterisks. When something slips - and something always slips - the team negotiates whether the miss was their fault or the fault of a dependency.
The structural problem is that teams are organized around technical components or layers rather than around end-to-end capabilities. A feature that delivers value to a user requires work from three teams because no single team owns the full stack for that capability. The teams are coupled by the feature, even if the architecture nominally separates them.
Common causes
Tightly coupled monolith
When services or components are tightly coupled, changes to one require coordinated changes in others. A change to the data model requires the API team to update their queries, which requires the frontend team to update their calls. Teams working on different parts of a tightly coupled system cannot proceed independently because the code does not allow it.
Decomposed systems with stable interfaces allow teams to work against contracts rather than against each other’s code. When an interface is stable, the consuming team can proceed without waiting for the providing team to finish. The items that spent a sprint sitting in “blocked” status start moving again because the code no longer requires the other team to act first.
Read more: Tightly coupled monolith
Distributed monolith
Services that are nominally independent but require coordinated deployment create the same dependency patterns as a monolith. Teams that own different services in a distributed monolith cannot ship independently. Every feature delivery is a joint operation involving multiple teams whose services must change and deploy together.
Services that are genuinely independent can be changed, tested, and deployed without coordination. True service independence is a prerequisite for team independence. Sprint planning stops being a dependency negotiation session when each team’s services can ship without waiting on another team’s deployment schedule.
Read more: Distributed monolith
Horizontal slicing
When teams are organized by technical layer - front end, back end, database - every user-facing feature requires coordination across all teams. The frontend team needs the API before they can build the UI. The API team needs the database schema before they can write the queries. No team can deliver a complete feature independently.
Organizing teams around vertical slices of capability - a team that owns the full stack for a specific domain - eliminates most cross-team dependencies. The team that owns the feature can deliver it without waiting on other teams.
Read more: Horizontal slicing
Monolithic work items
Large work items have more opportunities to intersect with other teams’ work. A story that takes one week and touches the data layer, the API layer, and the UI layer requires coordination with three teams at three different times. Smaller items scoped to a single layer or component can often be completed within one team without external dependencies.
Decomposing large items into smaller, more self-contained pieces reduces the surface area of cross-team interaction. Even when teams remain organized by layer, smaller items spend less time in blocked states.
Read more: Monolithic work items
How to narrow it down
- Does changing one team’s service require changing another team’s service? If interface changes cascade across teams, the services are coupled. Start with Tightly coupled monolith.
- Must multiple services deploy simultaneously to deliver a feature? If services cannot be deployed independently, the architecture is the constraint. Start with Distributed monolith.
- Does each team own only one technical layer? If no team can deliver end-to-end functionality, the organizational structure creates dependencies. Start with Horizontal slicing.
- Are work items frequently blocked waiting on another team’s deliverable? If items spend more time blocked than in progress, decompose items to reduce cross-team surface area. Start with Monolithic work items.
Ready to fix this? The most common cause is Tightly coupled monolith. Start with its How to Fix It section for week-by-week steps.
2.5 - Everything Started, Nothing Finished
The board shows many items in progress but few reaching done. The team is busy but not delivering.
What you are seeing
Open the team’s board on any given day. Count the items in progress. Count the team members. If
the first number is significantly higher than the second, the team has a WIP problem. Every
developer is working on a different story. Eight items in progress, zero done. Nothing gets the
focused attention needed to finish.
At the end of the sprint, there is a scramble to close anything. Stories that were “almost done”
for days finally get pushed through. Cycle time is long and unpredictable. The team is busy all
the time but finishes very little.
Common causes
Push-Based Work Assignment
When managers assign work to individuals rather than letting the team pull from a prioritized
backlog, each person ends up with their own queue of assigned items. WIP grows because work is
distributed across individuals rather than flowing through the team. Nobody swarms on blocked
items because everyone is busy with “their” assigned work.
Read more: Push-Based Work Assignment
Horizontal Slicing
When work is split by technical layer (“build the database schema,” “build the API,” “build the
UI”), each layer must be completed before anything is deployable. Multiple developers work on
different layers of the same feature simultaneously, all “in progress,” none independently done.
WIP is high because the decomposition prevents any single item from reaching completion quickly.
Read more: Horizontal Slicing
Unbounded WIP
When the team has no explicit constraint on how many items can be in progress simultaneously,
there is nothing to prevent WIP from growing. Developers start new work whenever they are
blocked, waiting for review, or between tasks. Without a limit, the natural tendency is to stay
busy by starting things rather than finishing them.
Read more: Unbounded WIP
How to narrow it down
- Does each developer have their own assigned backlog of work? If yes, the assignment model
prevents swarming and drives individual queues. Start with
Push-Based Work Assignment.
- Are work items split by technical layer rather than by user-visible behavior? If yes,
items cannot be completed independently. Start with
Horizontal Slicing.
- Is there any explicit limit on how many items can be in progress at once? If no, the team
has no mechanism to stop starting and start finishing. Start with
Unbounded WIP.
Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.
Related Content
2.6 - Vendor Release Cycles Constrain the Team's Deployment Frequency
Upstream systems deploy quarterly or downstream consumers require advance notice. External constraints set the team’s release schedule.
What you are seeing
The team is ready to deploy. But the upstream payment provider releases their API once a quarter and the new version the team depends on is not live yet. Or the downstream enterprise consumer the team integrates with requires 30 days advance notice before any API change goes live. The team’s own deployment readiness is irrelevant - external constraints set the schedule.
The team adapts by aligning their release cadence with their most constraining external dependency. If one vendor deploys quarterly, the team deploys quarterly. Every advance the team makes in internal deployment speed is nullified by the external constraint. The most sophisticated internal pipeline in the world still produces a team that ships four times per year.
Some external constraints are genuinely fixed. A payment network’s settlement schedule, regulatory reporting requirements, hardware firmware update cycles - these cannot be accelerated. But many “external” constraints turn out to be negotiable, workaroundable through abstraction, or simply assumed to be fixed without ever being tested.
Common causes
Tightly coupled monolith
When the team’s system is tightly coupled to third-party systems at the technical level, any change to either side requires coordinated deployment. The integration code is tightly bound to specific vendor API versions, specific response shapes, specific timing assumptions. Wrapping third-party integrations in adapter layers creates the abstraction needed to deploy the team’s side independently.
An adapter that isolates the team’s code from vendor-specific details can handle multiple API versions simultaneously. The team can deploy their adapter update, leaving the old vendor path active until the vendor’s new version is available, then switch.
Read more: Tightly coupled monolith
Distributed monolith
When the team’s services must be deployed in coordination with other systems - whether internal or external - the coupling forces joint releases. Each deployment event becomes a multi-party coordination exercise. The team cannot ship independently because their services are not actually independent.
Services that expose stable interfaces and handle both old and new protocol versions simultaneously can be deployed and upgraded without coordinating with consumers. That interface stability is what removes the external constraint: the team can ship on their own schedule because changing one side no longer requires the other side to change at the same time.
Read more: Distributed monolith
Missing deployment pipeline
Without a pipeline, there is no mechanism for gradual migrations - running old and new integration paths simultaneously during a transition period. Switching to a new vendor API requires deploying new code that breaks old behavior unless both paths are maintained in parallel.
A pipeline with feature flag support can activate the new vendor integration for a subset of traffic, validate it against real load, and then complete the migration when confidence is established. This decouples the team’s deployment from the vendor’s release schedule.
Read more: Missing deployment pipeline
How to narrow it down
- Is the team’s code tightly bound to specific vendor API versions? If the integration cannot handle multiple vendor versions simultaneously, every vendor change requires a coordinated deployment. Start with Tightly coupled monolith.
- Must the team coordinate deployment timing with external parties? If yes, the interfaces between systems do not support independent deployment. Start with Distributed monolith.
- Can the team run old and new integration paths simultaneously? If switching to a new vendor version is a hard cutover, the pipeline does not support gradual migration. Start with Missing deployment pipeline.
Ready to fix this? The most common cause is Tightly coupled monolith. Start with its How to Fix It section for week-by-week steps.
2.7 - Services in the Same Portfolio Have Wildly Different Maturity Levels
Some services have full pipelines and coverage. Others have no tests and are deployed manually. No consistent baseline exists.
What you are seeing
Some services have full pipelines, comprehensive test coverage, automated deployment, and monitoring dashboards. Others have no tests, no pipeline, and are deployed by copying files onto a server. Both sit in the same team’s portfolio. The team’s CD practices apply to the modern ones. The legacy ones exist outside them.
Improving the legacy services feels impossible to prioritize. They are not blocking any immediate feature work. The incidents they cause are infrequent enough to accept. Adding tests, setting up a pipeline, and improving the deployment process are multi-week investments with no immediate visible output. They compete for sprint capacity against features that have product owners and deadlines.
The maturity gap widens over time. The modern services get more capable as the team’s CD practices improve. The legacy ones stay frozen. Eventually they represent a liability: they cannot benefit from any of the team’s improved practices, they are too risky to touch, and they handle increasingly critical functionality as other services are modernized around them.
Common causes
Missing deployment pipeline
Services without pipelines cannot participate in the team’s CD practices. The pipeline is the foundation on which automated testing, deployment automation, and observability build. A service with no pipeline is a service that will always require manual attention for every change.
Establishing a minimal viable pipeline for every service - even if it just runs existing tests and provides a deployment command - closes the gap between the modern services and the legacy ones. A service with even a basic pipeline can participate in the team’s practices and improve from there; a service with no pipeline cannot improve at all.
Read more: Missing deployment pipeline
Thin-spread teams
Teams spread across too many services and responsibilities cannot allocate the focused investment needed to bring lower-maturity services up to standard. Each sprint, the urgency of visible work displaces the sustained effort that improvement requires. Investment in a legacy service delivers no value for weeks before the improvement becomes visible.
Teams with appropriate scope relative to capacity can allocate improvement time in each sprint. A team that owns two services instead of six can invest in both. A team that owns six has to accept that four will be neglected.
Read more: Thin-spread teams
How to narrow it down
- Does every service in the team’s portfolio have an automated deployment pipeline? If not, identify which services lack pipelines and why. Start with Missing deployment pipeline.
- Does the team have time to improve services that are not actively producing incidents? If improvement work is always displaced by feature or incident work, the team is spread too thin. Start with Thin-spread teams.
- Are there services the team owns but is afraid to touch? Fear of touching a service is a strong indicator that the service lacks the safety nets (tests, pipeline, documentation) needed for safe modification.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
2.8 - Some Developers Are Overloaded While Others Wait for Work
Work is distributed unevenly across the team. Some developers are chronically overloaded while others finish early and wait for new assignments.
What you are seeing
Sprint planning ends with everyone assigned roughly the same number of story points. By midweek,
two developers have finished their work and are waiting for something new, while three others are
behind and working evenings to catch up. The imbalance repeats every sprint, but the people who
are overloaded shift unpredictably.
At standup, some developers report being blocked or overwhelmed while others report nothing to
do. Managers respond by reassigning work in flight, which disrupts both the giver and the
receiver. The team’s throughput is limited by the most overloaded members even when others have
capacity.
Common causes
Push-Based Work Assignment
When managers distribute work at sprint planning, they are estimating in advance how long each
item will take and who is the right person for it. Those estimates are routinely wrong. Some
items take twice as long as expected; others finish in half the time. Because work was
pre-assigned, there is no mechanism for the team to self-balance. Fast finishers wait for new
assignments while slow finishers fall behind, regardless of available team capacity.
In a pull system, workloads balance automatically: whoever finishes first pulls the next
highest-priority item. No manager needs to predict durations or redistribute work mid-sprint.
Read more: Push-Based Work Assignment
Thin-Spread Teams
When a team is responsible for too many products or codebases, workload spikes in one area
cannot be absorbed by people working in another. Each developer is already committed to their
domain. The team cannot rebalance because work is siloed by system ownership rather than
flowing to whoever has capacity.
Read more: Thin-Spread Teams
How to narrow it down
- Does work get assigned at sprint planning and rarely change hands afterward? If
assignments are fixed at the start of the sprint and the team has no mechanism for
rebalancing mid-sprint, the assignment model is the root cause. Start with
Push-Based Work Assignment.
- Are developers unable to help with overloaded areas because they don’t know the codebase?
If the team cannot rebalance because knowledge is siloed, people are locked into their
assigned domain even when they have capacity. Start with
Thin-Spread Teams and
Knowledge Silos.
Ready to fix this? The most common cause is Push-Based Work Assignment. Start with its How to Fix It section for week-by-week steps.
Related Content
2.9 - Work Stalls Waiting for the Platform or Infrastructure Team
Teams cannot provision environments, update configurations, or access infrastructure without filing a ticket and waiting for a separate platform or ops team to act.
What you are seeing
A team needs a new environment for testing, a configuration value updated, a database instance
provisioned, or a new service account created. They file a ticket. The platform team has its own
backlog and prioritization process. The ticket sits for two days, then a week. The team’s sprint
work is blocked until it is resolved. When the platform team delivers, there is a round of
back-and-forth because the request was not specific enough, and the team waits again.
This happens repeatedly across different types of requests: compute resources, network access,
environment variables, secrets, certificates, DNS entries. Each one is a separate ticket, a
separate queue, a separate wait. Developers learn to front-load requests at the beginning of
sprints to get ahead of the lead time, but the lead times shift and the requests still arrive
too late.
Common causes
Separate Ops/Release Team
When infrastructure and platform work is owned by a separate team, developers have no path to
self-service. Every infrastructure need becomes a cross-team request. The platform team is
optimizing its own backlog, which may not align with the delivery team’s priorities. The
structural separation means that the team doing the work and the team enabling the work have
different schedules, different priorities, and different definitions of urgency.
Read more: Separate Ops/Release Team
No On-Call or Operational Ownership
When delivery teams do not own their infrastructure and operational concerns, they have no
incentive or capability to build self-service tooling. The platform team owns the infrastructure
and therefore controls access to it. Teams that own their own operations build automation and
self-service interfaces because the cost of tickets falls on them. Teams that don’t own operations
accept the ticket queue because there is no alternative.
Read more: No On-Call or Operational Ownership
How to narrow it down
- Does the team file tickets for infrastructure changes that should take minutes? If
provisioning a test environment or updating a config value requires a cross-team request and
a multi-day wait, the team lacks self-service capability. Start with
Separate Ops/Release Team.
- Does the team own the operational concerns of what they build? If another team manages
production, monitoring, and infrastructure for the delivery team’s services, the delivery team
has no path to self-service. Start with
No On-Call or Operational Ownership.
Ready to fix this? The most common cause is Separate Ops/Release Team. Start with its How to Fix It section for week-by-week steps.
Related Content
2.10 - Work Items Take Days or Weeks to Complete
Stories regularly take more than a week from start to done. Developers go days without integrating.
What you are seeing
A developer picks up a work item on Monday. By Wednesday, they are still working on it. By
Friday, it is “almost done.” The following Monday, they are fixing edge cases. The item finally
moves to review mid-week as a 300-line pull request that the reviewer does not have time to look
at carefully.
Cycle time is measured in weeks, not days. The team commits to work at the start of the sprint
and scrambles at the end. Estimates are off by a factor of two because large items hide unknowns
that only surface mid-implementation.
Common causes
Horizontal Slicing
When work is split by technical layer rather than by user-visible behavior, each item spans an
entire layer and takes days to complete. “Build the database schema,” “build the API,” “build the
UI” are each multi-day items. Nothing is deployable until all layers are done. Vertical slicing
(cutting thin slices through all layers to deliver complete functionality) produces items that
can be finished in one to two days.
Read more: Horizontal Slicing
Monolithic Work Items
When the team takes requirements as they arrive without breaking them into smaller pieces, work
items are as large as the feature they describe. A ticket titled “Add user profile page” hides
a login form, avatar upload, email verification, notification preferences, and password reset.
Without a decomposition practice during refinement, items arrive at planning already too large
to flow.
Read more: Monolithic Work Items
Long-Lived Feature Branches
When developers work on branches for days or weeks, the branch and the work item are the same
size: large. The branching model reinforces large items because there is no integration pressure
to finish quickly. Trunk-based development creates natural pressure to keep items small enough to
integrate daily.
Read more: Long-Lived Feature Branches
Push-Based Work Assignment
When work is assigned to individuals, swarming is not possible. If the assigned developer hits a
blocker - a dependency, an unclear requirement, a missing skill - they work around it alone rather
than asking for help. Asking for help means pulling a teammate away from their own assigned work,
so developers hesitate. Items sit idle while the assigned person waits or context-switches rather
than the team collectively resolving the blocker.
Read more: Push-Based Work Assignment
How to narrow it down
- Are work items split by technical layer? If the board shows items like “backend for
feature X” and “frontend for feature X,” the decomposition is horizontal. Start with
Horizontal Slicing.
- Do items arrive at planning without being broken down? If items go from “product owner
describes a feature” to “developer starts coding” without a decomposition step, start with
Monolithic Work Items.
- Do developers work on branches for more than a day? If yes, the branching model allows
and encourages large items. Start with
Long-Lived Feature Branches.
- Do blocked items sit idle rather than getting picked up by another team member? If work
stalls because it “belongs to” the assigned person and nobody else touches it, the assignment
model is preventing swarming. Start with
Push-Based Work Assignment.
Ready to fix this? The most common cause is Monolithic Work Items. Start with its How to Fix It section for week-by-week steps.
Related Content
3 - Developer Experience Problems
Tooling friction, environment setup, local development, and codebase maintainability problems.
Symptoms related to the tools, environments, and codebase conditions that slow developers down
day to day.
3.1 - AI Tooling Slows You Down Instead of Speeding You Up
It takes longer to explain the task to the AI, review the output, and fix the mistakes than it would to write the code directly.
What you are seeing
A developer opens an AI chat window to implement a function. They spend ten minutes writing a
prompt that describes the requirements, the constraints, the existing patterns in the codebase,
and the edge cases. The AI generates code. The developer reads through it line by line because
they have no acceptance criteria to verify against. They spot that it uses a different pattern
than the rest of the codebase and misses a constraint they mentioned. They refine the prompt.
The AI produces a second version. It is better but still wrong in a subtle way. The developer
fixes it by hand. Total time: forty minutes. Writing it themselves would have taken fifteen.
This is not a one-time learning curve. It happens repeatedly, on different tasks, across the
team. Developers report that AI tools help with boilerplate and unfamiliar syntax but actively
slow them down on tasks that require domain knowledge, codebase-specific patterns, or
non-obvious constraints. The promise of “10x productivity” collides with the reality that
without clear acceptance criteria, reviewing AI output means auditing the implementation
detail by detail - which is often harder than writing the code from scratch.
Common causes
Skipping Specification and Prompting Directly
The most common cause of AI slowdown is jumping straight to code generation without
defining what the change should do. Instead of writing an intent description, BDD scenarios,
and acceptance criteria first, the developer writes a long prompt that mixes requirements,
constraints, and implementation hints into a single message. The AI guesses at the scope.
The developer reviews line by line because they have no checklist of expected behaviors. The
prompt-review-fix cycle repeats until the output is close enough.
The specification workflow from the
Agent Delivery Contract exists to
prevent this. When the developer defines the intent (what the change should accomplish), the
BDD scenarios (observable behaviors), and the acceptance criteria (how to verify correctness)
before generating code, the AI has a constrained target and the developer has a checklist.
If the specification for a single change takes more than fifteen minutes, the change is too
large - split it.
Agents can help with specification itself. The
agent-assisted specification
workflow uses agents to find gaps in your intent, draft BDD scenarios, and surface edge cases -
all before any code is generated. This front-loads the work where it is cheapest: in
conversation, not in implementation review.
Read more: Agent-Assisted Specification
Missing Working Agreements on AI Usage
When the team has no shared understanding of which tasks benefit from AI and which do not,
developers default to using AI on everything. Some tasks - writing a parser for a well-defined
format, generating test fixtures, scaffolding boilerplate - are good AI targets. Other tasks -
implementing complex business rules, debugging production issues, refactoring code with
implicit constraints - are poor AI targets because the context transfer cost exceeds the
implementation cost.
Without a shared agreement, each developer discovers this boundary independently through wasted
time.
Read more: No Shared Workflow Expectations
Knowledge Silos
When domain knowledge is concentrated in a few people, the acceptance criteria for domain-heavy
work exist only in those people’s heads. They can implement the feature faster than they can
articulate the criteria for an AI prompt. For developers who do not have the domain knowledge,
using AI is equally slow because they lack the criteria to validate the output against. Both
situations produce slowdowns for different reasons - and both trace back to domain knowledge
that has not been made explicit.
Read more: Knowledge Silos
How to narrow it down
- Are developers jumping straight to code generation without defining intent, scenarios, and
acceptance criteria first? If the prompting-reviewing-fixing cycle consistently takes
longer than direct implementation, the problem is usually skipped specification, not the AI
tool. Start with
Agent-Assisted Specification
to define what the change should do before generating code.
- Does the team have a shared understanding of which tasks are good AI targets? If
individual developers are discovering this through trial and error, the team needs working
agreements. Start with the
AI Adoption Roadmap to identify
appropriate use cases.
- Are the slowest AI interactions on tasks that require deep domain knowledge? If AI
struggles most where implicit business rules govern the implementation, the problem is
not the AI tool but the knowledge distribution. Start with
Knowledge Silos.
Ready to fix this? Start with Agent-Assisted Specification to learn the specification workflow that front-loads clarity before code generation.
Related Content
3.2 - AI Is Generating Technical Debt Faster Than the Team Can Absorb It
AI tools produce working code quickly, but the codebase is accumulating duplication, inconsistent patterns, and structural problems faster than the team can address them.
What you are seeing
The team adopted AI coding tools six months ago. Feature velocity increased. But the codebase
is getting harder to work in. Each AI-assisted session produces code that works - it passes
tests, it satisfies the acceptance criteria - but it does not account for what already exists.
The AI generates a new utility function that duplicates one three files away. It introduces a
third pattern for error handling in a module that already has two. It copies a data access
approach that the team decided to move away from last quarter.
Nobody catches these issues in review because the review standard is “does it do what it
should and how do we validate it” - which is the right standard for correctness, but it does
not address structural fitness. The acceptance criteria say what the change should do. They do
not say “and it should use the existing error handling pattern” or “and it should not duplicate
the date formatting utility.”
The debt is invisible in metrics. Test coverage is stable or improving. Change failure rate is
flat. But development cycle time is creeping up because every new change must navigate around
the inconsistencies the previous changes introduced. Refactoring is harder because the AI
generated code in patterns the team did not choose and would not have written.
Common causes
No Scheduled Refactoring Sessions
AI generates code faster than humans refactor it. Without deliberate maintenance sessions
scoped to cleaning up recently touched files, the codebase drifts toward entropy faster than
it would with human-paced development. The team treats refactoring as something that happens
organically during feature work, but AI-assisted feature sessions are scoped to their
acceptance criteria and do not include cleanup.
The fix is not to allow AI to refactor during feature sessions - that mixes concerns and
makes commits unreviewable. It is to schedule explicit refactoring sessions with their own
intent, constraints, and acceptance criteria (all existing tests still pass, no behavior
changes).
Read more: Pitfalls and Metrics - Schedule refactoring as explicit sessions
No Review Gate for Structural Quality
The team’s review process validates correctness (does it satisfy acceptance criteria?) and
security (does it introduce vulnerabilities?) but not structural fitness (does it fit the
existing codebase?). Standard review agents check for logic errors, security defects, and
performance issues. None of them check whether the change duplicates existing code, introduces
a third pattern where one already exists, or violates the team’s architectural decisions.
Automating structural quality checks requires two layers in the pre-commit gate sequence.
Layer 1: Deterministic tools
Deterministic tools run before any AI review and catch mechanical structural problems without
token cost. These run in milliseconds and cannot be confused by plausible-looking but incorrect
code. Add them to the pre-commit hook sequence alongside lint and type checking:
- Duplication detection (e.g., jscpd) - flags when the same code block already exists
elsewhere in the codebase. When AI generates a utility that already exists three files away,
this catches it before review.
- Complexity thresholds (e.g., ESLint complexity rule, lizard) - flags functions that exceed
a cyclomatic complexity limit. AI-generated code tends toward deeply nested conditionals when
the prompt does not specify a complexity budget.
- Dependency and architecture rules (e.g., dependency-cruiser, ArchUnit) - encode module
boundary constraints as code. When the team decided to move away from a direct database access
pattern, architecture rules make violations a build failure rather than a code review comment.
These tools encode decisions the team has already made. Each one removes a category of
structural drift from the review queue entirely.
Layer 2: Semantic review agent with architectural constraints
The semantic review agent can catch structural drift that deterministic tools cannot detect -
like a third error-handling approach in a module that already has two - but only if the feature
description includes architectural constraints. If the feature description covers only functional
requirements, the agent has no basis for evaluating structural fit.
Add a constraints section to the feature description for every change:
- “Use the existing UserRepository pattern - do not introduce new data access approaches”
- “Error handling in this module follows the Result type pattern - do not introduce exceptions”
- “New utilities belong in the shared/utils directory - do not create module-local utilities”
When the agent generates code that violates a stated constraint, the semantic review agent
flags it. Without stated constraints, the agent cannot distinguish deliberate new patterns
from drift.
The two layers are complementary. Deterministic tools handle mechanical violations fast and
cheaply. The semantic review agent handles intent alignment and pattern consistency, but only
where the feature description defines what those patterns are.
Read more: Coding and Review Agent Configuration - Semantic Review Agent
Rubber-Stamping AI-Generated Code
When developers do not own the change - cannot articulate what it does, what criteria they
verified, or how they would detect a failure - they also do not evaluate whether the change
fits the codebase. Structural quality requires someone to notice that the AI reinvented
something that already exists. That noticing only happens when a human is engaged enough with
the change to compare it against their knowledge of the existing system.
Read more: Rubber-Stamping AI-Generated Code
How to narrow it down
- Does the pre-commit gate include duplication detection, complexity limits, and
architecture rules? If the only automated structural check is lint, the gate catches
style violations but not structural drift. Add deterministic structural tools to the hook
sequence described in
Coding and Review Agent Configuration.
- Do feature descriptions include architectural constraints, not just functional
requirements? If the feature description only says what the change should do but not how
it should fit structurally, the semantic review agent has no basis for checking pattern
conformance. Start by adding constraints to the
Agent Delivery Contract.
- Is the team scheduling explicit refactoring sessions after feature work? If cleanup
only happens incidentally during feature sessions, debt accumulates with every AI-assisted
change. Start with the
Pitfalls and Metrics
guidance on scheduling maintenance sessions after every three to five feature sessions.
- Can developers identify where a new change duplicates existing code? If nobody in the
review process is comparing the AI’s output against existing utilities and patterns, the
team is not engaged enough with the change to catch structural drift. Start with
Rubber-Stamping AI-Generated Code.
Ready to fix this? Start with the pre-commit gate. Add duplication detection and architecture
rules to the hook sequence from Coding and Review Agent Configuration,
then add architectural constraints to your feature description template. These two changes automate
detection of the most common structural drift patterns on every change.
Related Content
3.3 - Data Pipelines and ML Models Have No Deployment Automation
Application code has a CI/CD pipeline, but ML models and data pipelines are deployed manually or on an ad hoc schedule.
What you are seeing
ML models and data pipelines are deployed manually while application code has a full CI/CD pipeline. When a developer pushes a change to the application, tests run, an artifact is built, and deployment promotes automatically through environments. But the ML model that drives the product’s recommendations was trained two months ago and deployed by a data scientist who ran a Python script from their laptop. Nobody knows which version of the model is in production or what training data it was built on.
Data pipelines have a similar problem. The ETL job that populates the feature store was written in a Jupyter notebook, runs on a schedule via a cron job on a single server, and is updated by manually copying a new version to the server when it changes. There is no version control for the notebook, no automated tests for the pipeline logic, and no staging environment where the pipeline can be validated before it runs against production data.
Common causes
Missing deployment pipeline
The pipeline infrastructure that handles application deployments was not extended to cover model artifacts and data pipelines. Extending it requires ML-aware tooling - model registries, data versioning, training pipelines - that must be built or configured separately from standard application pipeline tools.
Establishing basic practices first - version control for pipeline code, a model registry with version tracking, automated tests for pipeline logic - creates the foundation. A minimal pipeline that validates data pipeline changes before production deployment closes the gap between how application code and model artifacts are treated, removing the dual delivery standard.
Read more: Missing deployment pipeline
Manual deployments
The default for ML work is manual because the discipline of ML operations is younger than software deployment automation. Without deliberate investment in model deployment automation, manual remains the default: a data scientist deploys a model by running a script, updating a config file, or copying files to a server.
Applying the same deployment automation principles to model deployment - versioned artifacts, automated promotion, health checks after deployment - closes the gap between ML and application delivery standards.
Read more: Manual deployments
Knowledge silos
Model deployment and data pipeline operations often live with specific individuals who have the expertise and the access to execute them. When those people are unavailable, model retraining, pipeline updates, and deployment operations cannot happen. The knowledge of how the ML infrastructure works is not distributed.
Documenting deployment procedures, building runbooks for model rollback, and cross-training team members on data infrastructure operations distributes the knowledge before automation is in place.
Read more: Knowledge silos
How to narrow it down
- Is the currently deployed model version tracked in version control with a record of when it was deployed? If not, there is no audit trail for model deployments. Start with Missing deployment pipeline.
- Can any engineer deploy an updated model or data pipeline, or does it require a specific person? If specific expertise is required, the knowledge is siloed. Start with Knowledge silos.
- Are data pipeline changes validated in a non-production environment before running against production data? If not, data pipeline changes go directly to production without validation. Start with Manual deployments.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
3.4 - The Codebase No Longer Reflects the Business Domain
Business terms are used inconsistently. Domain rules are duplicated, contradicted, or implicit. No one can explain all the invariants the system is supposed to enforce.
What you are seeing
The same business concept goes by three different names in three different modules. A rule about
how orders are validated exists in the API layer, partially in a service, and also in the
database - with slight differences between them. A developer making a change to the payments flow
discovers undocumented assumptions mid-implementation and is not sure whether they are intentional
constraints or historical accidents.
New developers cannot form a coherent mental model of the domain from the code alone. They learn
by asking colleagues, but colleagues often disagree or are uncertain. The system works, mostly,
but nobody can fully explain why it is structured the way it is or what would break if a
particular constraint were removed.
Common causes
Thin-Spread Teams
When engineers rotate through a domain without staying long enough to understand its business
rules deeply, each rotation leaves its own layer of interpretation on the codebase. One team
names a concept one way. The next team introduces a parallel concept with a different name
because they did not recognize the existing one. A third team adds a validation rule without
knowing an equivalent rule already existed elsewhere. Over time the code reflects the sequence
of teams that worked in it rather than the business domain it is supposed to model.
Read more: Thin-Spread Teams
Knowledge Silos
When the canonical understanding of the domain lives in a few individuals, the code drifts from
that understanding whenever those individuals are not involved in a change. Developers without
deep domain knowledge make reasonable-seeming implementation choices that violate rules they were
never told about. The gap between what the domain expert knows and what the code expresses widens
with each change made without them.
Read more: Knowledge Silos
How to narrow it down
- Are the same business concepts named differently in different parts of the codebase? If
a developer must learn multiple synonyms for the same thing to navigate the code, the domain
model has been interpreted independently by multiple teams. Start with
Thin-Spread Teams.
- Can team members explain all the validation rules the system enforces, and do their
explanations agree? If there is disagreement or uncertainty, domain knowledge is not
shared or externalized. Start with
Knowledge Silos.
Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.
Related Content
3.5 - The Development Workflow Has Friction at Every Step
Slow CI servers, poor CLI tools, and no IDE integration. Every step in the development process takes longer than it should.
What you are seeing
The CI servers are slow. A build that should take 5 minutes takes 25 because the agents are undersized and the queue is long. The IDE has no integration with the team’s testing framework, so running a specific test requires dropping to the command line and remembering the exact invocation syntax. The deployment CLI has no tab completion and cryptic error messages. The local development environment requires a 12-step ritual to restart after any configuration change.
Individual friction points seem minor in isolation. A 20-second wait is a slight inconvenience. A missing IDE shortcut is a small annoyance. But friction compounds. A developer who waits 20 seconds, remembers a command, waits 20 more seconds, then navigates an opaque error message has spent a minute on a task that should take 5 seconds. Across ten such interactions per day, across an entire team, this is a meaningful tax on throughput.
The larger cost is attentional, not temporal. Friction interrupts flow. When a developer has to stop thinking about the problem they are solving to remember a command syntax, context-switch to a different tool, or wait for an operation to complete, they lose the thread. Flow states that make complex problems tractable are incompatible with constant context switches caused by tooling friction.
Common causes
Missing deployment pipeline
Investment in pipeline tooling - build caching, parallelized test execution, automated deployment scripts with good error messages - directly reduces the friction of getting changes to production. Teams without this investment accumulate tooling debt. Each year that passes without improving the pipeline leaves a more elaborate set of workarounds in place.
A team that treats the pipeline as a first-class product, maintained and improved the same way they maintain production code, eliminates friction points incrementally. The slow CI queue, the missing IDE integration, the opaque deployment errors - each one is a bug in the pipeline product, and bugs get fixed when someone owns the product.
Read more: Missing deployment pipeline
Manual deployments
When the deployment process is manual, there is no pressure to make the tooling ergonomic. The person doing the deployment learns the steps and adapts. Automation forces the deployment process to be scripted, which creates an interface that can be improved, tested, and measured. A deployment script with good error messages and clear output is a better tool than a deployment runbook, and it can be improved as a piece of software.
Read more: Manual deployments
How to narrow it down
- How long does a full pipeline run take? If builds take more than 10 minutes, build caching and parallelization are likely available but not implemented. Start with Missing deployment pipeline.
- Can a developer deploy with a single command that provides clear output? If deployment requires multiple manual steps with opaque error messages, the tooling has not been invested in. Start with Manual deployments.
- Are builds getting faster over time? If build time is stable or increasing, nobody is actively working on pipeline performance. Start with Missing deployment pipeline.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
3.6 - Getting a Test Environment Requires Filing a Ticket
Test environments are a scarce, contended resource. Provisioning takes days and requires another team’s involvement.
What you are seeing
A developer needs a clean environment to reproduce a bug. They file a ticket with the infrastructure team requesting environment access. The ticket enters a queue. Two days later, the environment is provisioned. By that time the developer has moved on to other work, the context for the bug is cold, and the urgency has faded.
Test environments are scarce because they are expensive to create manually. The infrastructure team provisions each one by hand: configuring servers, installing dependencies, seeding databases, updating DNS. The process takes hours of skilled work. Because it takes hours, environments are treated as long-lived shared resources rather than disposable per-task resources. Multiple teams share the same staging environment, which creates contention, coordination overhead, and mysterious failures when two teams’ work interacts unexpectedly.
The team has adapted by scheduling environment usage in advance and batching testing work. These adaptations work until there is a deadline, at which point contention over shared environments becomes a delivery risk.
Common causes
Snowflake environments
When environments are configured by hand, they cannot be created on demand. The cost of creating a new environment is the same as the cost of the initial configuration: hours of skilled work. This cost makes environments permanent rather than ephemeral. Infrastructure as code and containerization make environment creation a fast, automated operation that any team member can trigger.
When environments can be created in minutes from code, they stop being scarce. A developer who needs an environment can create one, use it, and destroy it. Two teams working on conflicting features each have their own environment. Contention disappears.
Read more: Snowflake environments
Missing deployment pipeline
Pipelines that include environment provisioning steps can spin up, run tests against, and tear down ephemeral environments as part of every run. The environment is created fresh for each test run and destroyed when the run completes. Without this capability, environments are managed manually outside the pipeline and must be shared.
A pipeline with environment provisioning gives every commit its own isolated environment. There is no ticket to file, no queue to wait in, no contention with other teams - the environment exists for the duration of the run and is gone when the run completes.
Read more: Missing deployment pipeline
Knowledge silos
The knowledge of how to provision an environment lives in the infrastructure team. Until that knowledge is codified as scripts or infrastructure code, environment creation requires a human from that team. The infrastructure team becomes a bottleneck even when they are working as fast as they can.
Externalizing environment provisioning knowledge into code - reproducible, runnable by anyone - removes the dependency on the infrastructure team for routine environment needs.
Read more: Knowledge silos
How to narrow it down
- Can a developer create a new isolated test environment without filing a ticket? If not, environment creation is not self-service. Start with Snowflake environments.
- Do multiple teams share a single staging environment? Shared environments create contention and interference. Start with Missing deployment pipeline.
- Is environment provisioning knowledge documented as runnable code? If provisioning requires knowing undocumented manual steps, the knowledge is siloed. Start with Knowledge silos.
Ready to fix this? The most common cause is Snowflake environments. Start with its How to Fix It section for week-by-week steps.
3.7 - The Deployment Target Does Not Support Modern CI/CD Tooling
Mainframes or proprietary platforms require custom integration or manual steps. CD practices stop at the boundary of the legacy stack.
What you are seeing
The deployment target is a z/OS mainframe, an AS/400, an embedded device firmware platform, or a proprietary industrial control system. The standard CI/CD tools the rest of the organization uses do not support this target. The vendor’s deployment tooling is command-line based, requires a licensed runtime, and was designed around a workflow that predates modern software delivery practices.
The team’s modern application code lives in a standard git repository with a standard pipeline for the web tier. But the batch processing layer, the financial calculation engine, or the device firmware is deployed through a completely separate process involving FTP, JCL job cards, and a deployment checklist that exists as a Word document on a shared drive.
The organization’s CD practices stop at the boundary of the modern stack. The legacy platform exists in a different operational world with different tooling, different skills, different deployment cadence, and different risk models. Bridging the two worlds requires custom integration work that is unglamorous, expensive, and consistently deprioritized.
Common causes
Manual deployments
Legacy platform deployments are almost always manual. The platform predates modern deployment automation. The deployment procedure exists in documentation and in the heads of the people who have done it. Without investment in custom tooling, mainframe deployments remain manual indefinitely.
Building automation for a mainframe or proprietary platform requires understanding both the platform’s native tools and modern automation principles. The result may not look like a standard pipeline, but it can provide the same benefits: consistent, repeatable, auditable deployments that do not require a specific person.
Read more: Manual deployments
Missing deployment pipeline
A pipeline that covers the full deployment surface - modern application code, database changes, and legacy platform components - requires platform-specific extensions. Standard pipeline tools do not ship with mainframe support, but they can be extended with custom steps that invoke platform-native tools. Without this investment, the pipeline covers only the modern stack.
Building coverage incrementally - wrapping the most common deployment operations first, then expanding - is more achievable than trying to fully automate a complex legacy deployment in one effort.
Read more: Missing deployment pipeline
Knowledge silos
Mainframe and proprietary platform skills are rare and concentrating. Teams typically have one or two people who understand the platform deeply. When those people leave, the deployment process becomes opaque to everyone remaining. The knowledge that enables manual deployments is not distributed and not documented in a form anyone else can use.
Deliberately distributing platform knowledge - pair deployments, written procedures, runbooks that reflect the actual current process - reduces single-person dependency even before automation is available.
Read more: Knowledge silos
How to narrow it down
- Is there anyone on the team other than one or two people who can deploy to the legacy platform? If not, knowledge concentration is the immediate risk. Start with Knowledge silos.
- Is the legacy platform deployment automated in any way? If completely manual, automation of even one step is a starting point. Start with Manual deployments.
- Is the legacy platform deployment included in the same pipeline as modern services? If it is managed outside the pipeline, it lacks all the pipeline’s safety properties. Start with Missing deployment pipeline.
Ready to fix this? The most common cause is Manual deployments. Start with its How to Fix It section for week-by-week steps.
3.8 - Developers Cannot Run the Pipeline Locally
The only way to know if a change passes CI is to push it and wait. Broken builds are discovered after commit, not before.
What you are seeing
A developer makes a change, commits, and pushes to CI. Thirty minutes later, the build is red. A linting rule was violated. Or a test file was missing from the commit. Or the build script uses a different version of a dependency than the developer’s local machine. The developer fixes the issue and pushes again. Another wait. Another failure - this time a test that only runs in CI and not in the local test suite.
This cycle destroys focus. The developer cannot stay in flow waiting for CI results. They switch to something else, then switch back when the notification arrives. Each context switch adds recovery time. A change that took thirty minutes to write takes two hours from first commit to green build, and the developer was not thinking about it for most of that time.
The deeper issue is that CI and local development are different environments. Tests that pass locally fail in CI because of dependency version differences, missing environment variables, or test execution order differences. The developer cannot reproduce CI failures locally, which makes them much harder to debug and creates a pattern of “push and hope” rather than “validate locally and push with confidence.”
Common causes
Missing deployment pipeline
Pipelines designed for cloud-only execution - pulling from private artifact repositories, requiring CI-specific secrets, using platform-specific compute resources - cannot run locally by construction. The pipeline was designed for the CI environment and only the CI environment.
Pipelines designed with local execution in mind use tools that run identically in any environment: containerized build steps, locally runnable test commands, shared dependency resolution. A developer running the same commands locally that the pipeline runs in CI gets the same results. The feedback loop shrinks from 30 minutes to seconds.
Read more: Missing deployment pipeline
Snowflake environments
When the CI environment differs from the developer’s local environment in ways that affect test outcomes, local and CI results diverge. Different OS versions, different dependency caches, different environment variables, different file system behaviors - any of these can cause tests to pass locally and fail in CI.
Standardized, code-defined environments that run identically locally and in CI eliminate the divergence. If the build step runs inside the same container image locally and in CI, the results are the same.
Read more: Snowflake environments
How to narrow it down
- Can a developer run every pipeline step locally? If any step requires CI-specific infrastructure, secrets, or platform features, that step cannot be validated before pushing. Start with Missing deployment pipeline.
- Do tests produce different results locally versus in CI? If yes, the environments differ in ways that affect test outcomes. Start with Snowflake environments.
- How long does a developer wait between push and feedback? If feedback takes more than a few minutes, the incentive is to batch pushes and work on something else while waiting. Start with Missing deployment pipeline.
Ready to fix this? The most common cause is Missing deployment pipeline. Start with its How to Fix It section for week-by-week steps.
3.9 - Setting Up a Development Environment Takes Days
New team members are unproductive for their first week. The setup guide is 50 steps long and always out of date.
What you are seeing
A new developer spends two days troubleshooting before the system runs locally. The wiki setup page was last updated 18 months ago. Step 7 refers to a tool that has been replaced. Step 12 requires access to a system that needs a separate ticket to provision. Step 19 assumes an operating system version that is three versions behind. Getting unstuck requires finding a teammate who has memorized the real procedure from experience.
The setup problem is not just a new-hire experience. It affects the entire team whenever someone gets a new machine, switches between projects, or tries to set up a second environment for a specific debugging purpose. The environment is fragile because it was assembled by hand and the assembly process was never made reproducible.
The business cost is usually invisible. Two days of new-hire setup is charged to onboarding. Senior engineers spending half a day helping unblock new hires is charged to sprint work. Developers who avoid setting up new environments and work around the problem are charged to productivity. None of these costs appear on a dashboard that anyone monitors.
Common causes
Snowflake environments
When development environments are not reproducible from code, the assembly process exists only in documentation (which drifts) and in the heads of people who have done it before (who are not always available). Each environment is assembled slightly differently, which means the “how to set up a development environment” question has as many answers as there are developers on the team.
When the environment definition is versioned alongside the code, setup becomes a single command. A new developer who runs that command gets the same working environment as everyone else on the team - no 18-month-old wiki page, no tribal knowledge required, no two-day troubleshooting session. When the code changes in ways that require environment changes, the environment definition is updated at the same time.
Read more: Snowflake environments
Knowledge silos
The real setup procedure exists in the heads of specific team members who have run it enough times to know which steps to skip and which to do differently on which operating systems. When those people are unavailable, setup fails. The knowledge gap is only visible when someone needs it.
When environment setup is codified as runnable scripts and containers, the knowledge is distributed to everyone who can read the code. A new developer no longer has to find the one person who remembers which steps to skip - they run the script, and it works.
Read more: Knowledge silos
Tightly coupled monolith
When running any part of the application requires the full monolith running - including all its dependencies, services, and backing infrastructure - local setup is inherently complex. A developer who only needs to work on the notification service must stand up the entire application, all its databases, and all the services the notification service depends on, which is everything.
Decomposed services with stable interfaces can be developed in isolation. A developer working on the notification service stubs the services it calls and focuses on the piece they are changing. Setup is proportional to scope.
Read more: Tightly coupled monolith
How to narrow it down
- Can a new team member set up a working development environment without help? If not, the setup process is not self-contained. Start with Snowflake environments.
- Does setup require tribal knowledge that is not captured in the documented procedure? If team members need to “fill in the gaps” from memory, that knowledge needs to be externalized. Start with Knowledge silos.
- Does running a single service require running the entire application? If so, local development is inherently complex. Start with Tightly coupled monolith.
Ready to fix this? The most common cause is Snowflake environments. Start with its How to Fix It section for week-by-week steps.
3.10 - Bugs in Familiar Areas Take Disproportionately Long to Fix
Defects that should be straightforward take days to resolve because the people debugging them are learning the domain as they go. Fixes sometimes introduce new bugs in the same area.
What you are seeing
A bug is filed against the billing module. It looks simple from the outside - a calculation is
off by a percentage in certain conditions. The developer assigned to it spends a day reading code
before they can even reproduce the problem reliably. The fix takes another day. Two weeks later,
a related bug appears: the fix was correct for the case it addressed but violated an assumption
elsewhere in the module that nobody told the developer about.
Defect resolution time in specific areas of the system is consistently longer than in others.
Post-mortems note that the fix was made by someone unfamiliar with the domain. Bugs cluster in
the same modules, with fixes that address the symptom rather than the underlying rule that was
violated.
Common causes
Knowledge Silos
When only a few people understand a domain deeply, defects in that domain can only be resolved
quickly by those people. When they are unavailable - on leave, on another team, or gone - the
bug sits or gets assigned to someone who must reconstruct context before they can make progress.
The reconstruction is slow, incomplete, and prone to introducing new violations of rules the
developer discovers only after the fact.
Read more: Knowledge Silos
Thin-Spread Teams
When engineers are rotated through a domain based on capacity, the person available to fix a bug
is often not the person who knows the domain. They are familiar with the tech stack but not with
the business rules, edge cases, and historical decisions that make the module behave the way it
does. Debugging becomes an exercise in reverse-engineering domain knowledge from code that may
not accurately reflect the original intent.
Read more: Thin-Spread Teams
How to narrow it down
- Are defect resolution times consistently longer in specific modules than in others? If
certain areas of the system take significantly longer to debug regardless of defect severity,
those areas have a knowledge concentration problem. Start with
Knowledge Silos.
- Do fixes in certain areas frequently introduce new bugs in the same area? If corrections
create new violations, the developer fixing the bug lacks the domain knowledge to understand
the full set of constraints they are working within. Start with
Thin-Spread Teams.
Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.
Related Content
4 - Team and Knowledge Problems
Team stability, knowledge transfer, collaboration, and shared practices problems.
Symptoms related to team composition, knowledge distribution, and how team members work
together.
4.1 - The Team Has No Shared Working Hours Across Time Zones
Code reviews wait overnight. Questions block for 12+ hours. Async handoffs replace collaboration.
What you are seeing
A developer in London finishes a piece of work at 5 PM and creates a pull request. The reviewer in San Francisco is starting their day but has morning meetings and gets to the review at 2 PM Pacific - which is 10 PM London time, the next day. The author is offline. The reviewer leaves comments. The author responds the following morning. The review cycle takes four days for a change that would have taken 20 minutes with any overlap.
Integration conflicts sit unresolved for hours. The developer who could resolve the conflict is asleep when it is discovered. By the time they wake up, the main branch has moved further. Resolving the conflict now requires understanding changes made by multiple people across multiple time zones, none of whom are available simultaneously to sort it out.
The team has adapted with async-first practices: detailed PR descriptions, recorded demos, comprehensive written documentation. These adaptations reduce the cost of asynchrony but do not eliminate it. The team’s throughput is bounded by communication latency, and the work items that require back-and-forth are the most expensive.
Common causes
Long-lived feature branches
Long-lived branches mean that integration conflicts are larger and more complex when they finally surface. Resolving a small conflict asynchronously is tolerable. Resolving a three-day branch merge asynchronously is genuinely difficult - the changes are large, the context for each change is spread across people in different time zones, and the resolution requires understanding decisions made by people who are not available.
Frequent, small integrations to trunk reduce conflict size. A conflict that would have been 500 lines with a week-old branch is 30 lines when branches are integrated daily.
Read more: Long-lived feature branches
Monolithic work items
Large items create larger diffs, more complex reviews, and more integration conflicts. In a distributed team, the time cost of large items is amplified by communication overhead. A review that requires one round of comments takes one day in a distributed team. A review that requires three rounds takes three days. Large items that require extensive review are expensive by construction.
Small items have small diffs. Small diffs require fewer review rounds. Fewer review rounds means faster cycle time even with the communication latency of a distributed team.
Read more: Monolithic work items
Knowledge silos
When critical knowledge lives in one person and that person is in a different time zone, questions block for 12 or more hours. The developer in Singapore who needs to ask the database expert in London waits overnight for each exchange. Externalizing knowledge into documentation, tests, and code comments reduces the per-question communication overhead.
When the answer to a common question is in a runbook, a developer does not need to wait for the one person who knows. The knowledge is available regardless of time zone.
Read more: Knowledge silos
How to narrow it down
- What is the average number of review round-trips for a pull request? Each round-trip adds approximately one day of latency in a distributed team. Reducing item size reduces review complexity. Start with Monolithic work items.
- How often do integration conflicts require synchronous discussion to resolve? If conflicts regularly need a real-time conversation, they are large enough that asynchronous resolution is impractical. Start with Long-lived feature branches.
- Do developers regularly wait overnight for answers to questions? If yes, the knowledge needed for daily work is not accessible without specific people. Start with Knowledge silos.
Ready to fix this? The most common cause is Long-lived feature branches. Start with its How to Fix It section for week-by-week steps.
4.2 - Retrospectives Produce No Real Change
The same problems surface every sprint. Action items are never completed. The team has stopped believing improvement is possible.
What you are seeing
The same themes come up every sprint: too much interruption, unclear requirements, flaky tests, blocked items. The retrospective runs every two weeks. Action items are assigned. Two weeks later, none of them were completed because sprint work took priority. The same themes come up again. Someone adds them to the growing backlog of process improvements.
The team goes through the motions because the meeting is scheduled, not because they believe it will produce change. Participation is minimal. The facilitator works harder each time to generate engagement. The conversation stays surface-level because raising real problems feels pointless - nothing changes anyway.
The dysfunction runs deeper than meeting format. There is no capacity allocated for improvement work. Every sprint is 100% allocated to feature delivery. Action items that require real investment - automated deployment, test infrastructure, architectural cleanup - compete for time against items with committed due dates. The outcome is predetermined: features win.
Common causes
Unbounded WIP
When the team has more work in progress than capacity, every sprint has no slack. Action items from retrospectives require slack to complete. Without slack, improvement work is always displaced by feature work. The team is too busy to get less busy.
Creating and protecting capacity for improvement work is the prerequisite for retrospectives to produce change. Teams that allocate a fixed percentage of each sprint to improvement work - and defend it against feature pressure - actually complete their retrospective action items.
Read more: Unbounded WIP
Push-based work assignment
When work is assigned to the team from outside, the team has no authority over their own capacity allocation. They cannot protect time for improvement work because the queue is filled by someone else. Even if the team agrees in the retrospective that test automation is the priority, the next sprint’s work arrives already planned with no room for it.
Teams that pull work from a prioritized backlog and control their own capacity can make and honor commitments to improvement work. The retrospective can produce action items that the team has the authority to complete.
Read more: Push-based work assignment
Deadline-driven development
When management drives to fixed deadlines, all available capacity goes toward meeting the deadline. Improvement work that does not advance the deadline has no chance. The retrospective can surface the same problems indefinitely, but if the team has no capacity to address them and no organizational support to get that capacity, improvement is structurally impossible.
Read more: Deadline-driven development
How to narrow it down
- Are retrospective action items ever completed? If not, capacity is the first issue to examine. Start with Unbounded WIP.
- Does the team control how their sprint capacity is allocated? If improvement work must compete against externally assigned feature work, the team lacks the authority to act on retrospective outcomes. Start with Push-based work assignment.
- Is the team under sustained deadline pressure with no slack? If the team is always in crunch, improvement work has no room regardless of capacity or authority. Start with Deadline-driven development.
Ready to fix this? The most common cause is Unbounded WIP. Start with its How to Fix It section for week-by-week steps.
4.3 - The Team Has No Shared Agreements About How to Work
No explicit agreements on branch lifetime, review turnaround, WIP limits, or coding standards. Everyone does their own thing.
What you are seeing
Half the team uses feature branches; half commit directly to main. Some developers expect code reviews to happen within a few hours; others consider three days fast. Some engineers put every change through a full review; others self-merge small fixes. The WIP limit is nominally three items per person but nobody enforces it and most people carry five or six.
These inconsistencies create friction that is hard to name. Pull requests sit because there is no shared expectation for turnaround. Work items age because there is no agreement about WIP limits. Code quality varies because there is no agreement about review standards. The team functions, but at a lower level of coordination than it could with explicit norms.
The problem compounds as the team grows or becomes more distributed. A two-person co-located team can operate on implicit norms that emerge from constant communication. A six-person distributed team cannot. Without explicit agreements, each person operates on different mental models formed by prior team experiences.
Common causes
Push-based work assignment
When work is assigned to individuals by a manager or lead, team members operate as independent contributors rather than as a team managing flow together. Shared workflow norms only emerge meaningfully when the team experiences work as a shared responsibility - when they pull from a common queue, track shared flow metrics, and collectively own the delivery outcome.
Teams that pull work from a shared backlog develop shared norms because they need those norms to function - without agreement on review turnaround and WIP limits, pulling from the same queue becomes chaotic. When work is individually assigned, each person optimizes for their assigned items, not for team flow, and the shared agreements never form.
Read more: Push-based work assignment
Unbounded WIP
When there are no WIP limits, every norm around flow is implicitly optional. If work can always be added without limit, discipline around individual items erodes. “I’ll review that PR later” is always a reasonable response when there is always more work competing for attention.
WIP limits create the conditions where norms matter. When the team is committed to a WIP limit, review turnaround, merge cadence, and integration frequency become practical necessities rather than theoretical preferences.
Read more: Unbounded WIP
Thin-spread teams
Teams spread across many responsibilities often lack the continuous interaction needed to develop and maintain shared norms. Each member is operating in a different context, interacting with different parts of the codebase, working with different constraints. Common ground for shared agreements is harder to establish when everyone’s daily experience is different.
Read more: Thin-spread teams
How to narrow it down
- Does the team have written working agreements that everyone follows? If agreements are verbal or assumed, they will diverge under pressure. The absence of written agreements is the starting point.
- Do team members pull from a shared queue or receive individual assignments? Individual assignment reduces team-level flow ownership. Start with Push-based work assignment.
- Does the team enforce WIP limits? Without enforced limits, work accumulates until norms break down. Start with Unbounded WIP.
Ready to fix this? The most common cause is Push-based work assignment. Start with its How to Fix It section for week-by-week steps.
4.4 - The Same Mistakes Happen in the Same Domain Repeatedly
Post-mortems and retrospectives show the same root causes appearing in the same areas. Each new team makes decisions that previous teams already tried and abandoned.
What you are seeing
A post-mortem reveals that the payments module failed in the same way it failed eighteen months
ago. The fix applied then was not documented, and the developer who applied it is no longer on
the team. A retrospective surfaces a proposal to split the monolith into services - a direction
the team two rotations ago evaluated and rejected for reasons nobody on the current team knows.
The same conversations happen repeatedly. The same edge cases get missed. The same architectural
directions get proposed, piloted, and quietly abandoned without any record of why. Each new group
treats the domain as a fresh problem rather than building on what was learned before.
Common causes
Thin-Spread Teams
When engineers are rotated through a domain based on capacity rather than staying long enough to
build expertise, institutional memory does not accumulate. The decisions, experiments, and hard
lessons from previous rotations leave with those developers. The next group inherits the code but
not the understanding of why it is structured the way it is, what was tried before, or what the
failure modes are. They are likely to repeat the same exploration, reach the same dead ends, and
make the same mistakes.
Read more: Thin-Spread Teams
Knowledge Silos
When knowledge about a domain lives only in specific individuals, it evaporates when they leave.
Architectural decision records, runbooks, and documented post-mortem outcomes are the
externalized forms of that knowledge. Without them, every departure is a partial reset. The
remaining team cannot distinguish between “we haven’t tried that” and “we tried that and here
is what happened.”
Read more: Knowledge Silos
How to narrow it down
- Do post-mortems show the same root causes in the same areas of the system? If recurring
incidents map to the same modules and the fixes do not persist, the team is not accumulating
learning. Start with Thin-Spread Teams.
- Are architectural proposals evaluated without knowledge of what was tried before? If
the team cannot answer “was this approach considered previously, and what happened,” decisions
are being made without institutional memory. Start with
Knowledge Silos.
Ready to fix this? The most common cause is Knowledge Silos. Start with its How to Fix It section for week-by-week steps.
Related Content
4.5 - Delivery Slows Every Time the Team Rotates
A new developer joins or is flexed in and delivery slows for weeks while they learn the domain. The pattern repeats with every rotation.
What you are seeing
A developer is moved onto the team because there is capacity there and they know the tech stack.
For the first two to three weeks, velocity drops. Simple changes take longer than expected
because the new person is learning the domain while doing the work. They ask questions that
previous team members would have answered instantly. They make safe, conservative choices to
avoid breaking something they don’t fully understand.
Then the rotation ends or another team member is pulled away, and the cycle starts again. The
team never fully recovers its pre-rotation pace before the next disruption. Velocity measured
across a quarter looks flat even though the team is working as hard as ever.
Common causes
Thin-Spread Teams
When engineers are treated as interchangeable capacity and moved to where utilization is needed,
the team never develops stable domain expertise. Each rotation brings someone who knows the
technology but not the business rules, the data model quirks, the historical decisions, or the
failure modes that prior members learned through experience. The knowledge required to deliver
quickly in a domain cannot be acquired in days. It accumulates over months of working in it.
Read more: Thin-Spread Teams
Knowledge Silos
When domain knowledge lives in individuals rather than in documentation, runbooks, and code
structure, it is not available to the next person who joins. The new team member must reconstruct
understanding that the previous person carried in their head. Every rotation restarts that
reconstruction from scratch.
Read more: Knowledge Silos
How to narrow it down
- Does velocity measurably drop for several weeks after a team change? If the pattern is
consistent and repeatable, the team’s delivery speed depends on individual domain knowledge
rather than shared, documented understanding. Start with
Thin-Spread Teams.
- Is domain knowledge written down or does it live in specific people? If new team members
learn by asking colleagues rather than reading documentation, the knowledge is not externalized.
Start with Knowledge Silos.
Ready to fix this? The most common cause is Thin-Spread Teams. Start with its How to Fix It section for week-by-week steps.
Related Content
4.6 - Team Membership Changes Constantly
Members are frequently reassigned to other projects. There are no stable working agreements or shared context.
What you are seeing
The team roster changes every quarter. Engineers are pulled to other projects because they have relevant expertise, or they move to new teams as part of organizational restructuring. New members join but onboarding is informal - there is no written record of how the team works, what decisions were made and why, or what the technical context is.
The CD migration effort restarts with every significant roster change. New members bring different mental models and prior experiences. Practices the team adopted with care - trunk-based development, WIP limits, short-lived branches - get questioned by each new cohort who did not experience the problems those practices were designed to solve. The team keeps relitigating settled decisions instead of making progress.
The organizational pattern treats individual contributors as interchangeable resources. An engineer with payment domain expertise can be moved to the infrastructure team because the headcount numbers work out. The cost of that move - lost context, restarted relationships, degraded team performance for months - is invisible to the planning process that made the decision.
Common causes
Knowledge silos
When knowledge lives in individuals rather than in team practices, documentation, and code, departures create immediate gaps. The cost of reassignment is higher when the departing person carries critical knowledge that was never externalized. Losing one person does not just reduce capacity by one; it can reduce effective capability by much more if that person was the only one who understood a critical system or practice.
Teams that externalize knowledge into runbooks, architectural decision records, and documented practices distribute the cost of any individual departure. No single person’s absence leaves a critical gap. When a new cohort joins, the documented decisions and rationale are already there - the team stops relitigating trunk-based development and WIP limits because the record of why those choices were made is readable, not verbal.
Read more: Knowledge silos
Unbounded WIP
Teams with too much in progress are more likely to have members pulled to other projects, because they appear to have capacity even when they are spread thin. If a developer is working on five things simultaneously, moving them to another project looks like it frees up a resource. The depth of their contribution to each item is invisible to the person making the assignment decision.
WIP limits make the team’s actual capacity visible. When each person is focused on one or two things, it is clear that they are fully engaged and that removing them would directly impact those items. The reassignments that have been disrupting the team’s CD progress become less frequent because the real cost is finally visible to whoever is making the staffing decision.
Read more: Unbounded WIP
Thin-spread teams
When a team’s members are already distributed across many responsibilities, any departure creates disproportionate impact. Thin-spread teams have no redundancy to absorb turnover. Each person’s departure leaves a hole in a different area of the team’s responsibility surface.
Teams with focused, overlapping responsibilities can absorb turnover because multiple people share each area of responsibility. Redundancy is built in rather than assumed to exist. When a member is reassigned, the team’s work continues without a collapse in that area - the constant restart cycle that has been stalling the CD migration does not recur with every roster change.
Read more: Thin-spread teams
Push-Based Work Assignment
When work is assigned by specialty - “you’re the database person, so you take the database stories” - knowledge concentrates in individuals rather than spreading across the team. The same person always works the same area, so only they understand it deeply. When that person is reassigned or leaves, no one else can continue their work without starting over. Push-based assignment continuously deepens the knowledge silos that make every roster change more disruptive.
Read more: Push-Based Work Assignment
How to narrow it down
- Is critical system knowledge documented or does it live in specific individuals? If departures create knowledge gaps, the team has knowledge silos regardless of who leaves. Start with Knowledge silos.
- Does the team appear to have capacity because members are spread across many items? High WIP makes team members look available for reassignment. Start with Unbounded WIP.
- Is each team member the sole owner of a distinct area of the team’s work? If so, any departure leaves an unmanned responsibility. Start with Thin-spread teams.
- Is work assigned by specialty so the same person always works the same area? If departures leave knowledge gaps in specific parts of the system, assignment by specialty is reinforcing the silos. Start with Push-Based Work Assignment.
Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.