Test Suite Problems
Symptoms related to test reliability, coverage effectiveness, speed, and environment consistency.
These symptoms indicate problems with your testing strategy. Unreliable or slow tests erode
confidence and slow delivery. Each page describes what you are seeing and links to the
anti-patterns most likely causing it.
How to use this section
Start with the symptom that matches what your team experiences. Each symptom page explains what
you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic
questions to narrow down which cause applies to your situation. Follow the anti-pattern link to
find concrete fix steps.
Related anti-pattern categories: Testing Anti-Patterns,
Pipeline Anti-Patterns
Related guide: Testing Fundamentals
1 - AI-Generated Code Ships Without Developer Understanding
Developers accept AI-generated code without verifying it against acceptance criteria, and functional bugs and security vulnerabilities reach production unchallenged.
What you are seeing
A developer asks an AI assistant to implement a feature. The generated code looks plausible.
The tests pass. The developer commits it. Two weeks later, a security review finds the code
accepts unsanitized input in a path nobody specified as an acceptance criterion. When asked
what the change was supposed to do, the developer says, “It implements the feature.” When
asked how they validated it, they say, “The tests passed.”
This is not an occasional gap. It is a pattern. Developers use AI to produce code faster, but
they do not define what “correct” means before generating code, verify the output against
specific acceptance criteria, or consider how they would detect a failure in production. The
code compiles. The tests pass. Nobody validated it against the actual requirements.
The symptoms compound over time. Defects appear in AI-generated code that the team cannot
diagnose quickly because nobody defined what the code was supposed to do beyond “implement
the feature.” Fixes are made by asking the AI to fix its own output without re-examining the
original acceptance criteria. Security vulnerabilities - injection flaws, broken access
controls, exposed credentials - ship because nobody asked “what are the security constraints
for this change?” before or after generation.
Common causes
Rubber-Stamping AI-Generated Code
When there is no expectation that developers own what a change does and how they validated it -
regardless of who or what wrote the code - AI output gets the same cursory glance as a trivial
formatting change. The team treats “AI wrote it and the tests pass” as sufficient evidence of
correctness. It is not. Passing tests prove the code satisfies the test cases. They do not
prove the code meets the actual requirements or handles the constraints the team cares about.
Read more: Rubber-Stamping AI-Generated Code
Missing Acceptance Criteria
When the work item lacks concrete acceptance criteria - specific inputs, expected outputs,
security constraints, edge cases - neither the developer nor the AI has a clear target. The AI
generates something that looks right. The developer has no checklist to verify it against. The
review is a subjective “does this seem okay?” rather than an objective “does this satisfy every
stated requirement?”
Read more: Monolithic Work Items
Inverted Test Pyramid
When the test suite relies heavily on end-to-end tests and lacks targeted unit and functional
tests, AI-generated code can pass the suite without its internal logic being verified. A
comprehensive functional test suite would catch the cases where the AI’s implementation
diverges from the domain rules. Without it, “tests pass” is a weak signal.
Read more: Inverted Test Pyramid
How to narrow it down
- Can developers explain what their recent changes do and how they validated them? Pick
three recent AI-assisted commits at random and ask the committing developer: what does this
change accomplish, what acceptance criteria did you verify, and how would you detect if it
were wrong? If they cannot answer, the review process is not catching unexamined code.
Start with
Rubber-Stamping AI-Generated Code.
- Do your work items include specific, testable acceptance criteria before implementation
starts? If acceptance criteria are vague or added after the fact, neither the AI nor the
developer has a clear target. Start with
Monolithic Work Items.
- Does your test suite include functional tests that verify business rules with specific
inputs and outputs? If the suite is mostly end-to-end or integration tests, AI-generated
code can satisfy them without being correct at the rule level. Start with
Inverted Test Pyramid.
Ready to fix this? The most common cause is Rubber-Stamping AI-Generated Code. Start with its How to Fix It section for week-by-week steps.
Related Content
2 - Tests Pass in One Environment but Fail in Another
Tests pass locally but fail in CI, or pass in CI but fail in staging. Environment differences cause unpredictable failures.
What you are seeing
A developer runs the tests locally and they pass. They push to CI and the same tests fail. Or the
CI pipeline is green but the tests fail in the staging environment. The failures are not caused by
a code defect. They are caused by differences between environments: a different OS version, a
different database version, a different timezone setting, a missing environment variable, or a
service that is available locally but not in CI.
The developer spends time debugging the failure and discovers the root cause is environmental, not
logical. They add a workaround (skip the test in CI, add an environment check, adjust a timeout)
and move on. The workaround accumulates over time. The test suite becomes littered with
environment-specific conditionals and skipped tests.
The team loses confidence in the test suite because results depend on where the tests run rather
than whether the code is correct.
Common causes
Snowflake Environments
When each environment is configured by hand and maintained independently, they drift apart over
time. The developer’s laptop has one version of a database driver. The CI server has another. The
staging environment has a third. These differences are invisible until a test exercises a code
path that behaves differently across versions. The fix is not to harmonize configurations manually
(they will drift again) but to provision all environments from the same infrastructure code.
Read more: Snowflake Environments
Manual Deployments
When deployment and environment setup are manual processes, subtle differences creep in. One
developer installed a dependency a particular way. The CI server was configured by a different
person with slightly different settings. The staging environment was set up months ago and has not
been updated. Manual processes are never identical twice, and the variance causes environment-
dependent behavior.
Read more: Manual Deployments
Tightly Coupled Monolith
When the application has hidden dependencies on external state (filesystem paths, network
services, system configuration), tests that work in one environment fail in another because the
external state differs. Well-isolated code with explicit dependencies is portable across
environments. Tightly coupled code that reaches into its environment for implicit dependencies is
fragile.
Read more: Tightly Coupled Monolith
How to narrow it down
- Are all environments provisioned from the same infrastructure code? If not, environment
drift is the most likely cause. Start with
Snowflake Environments.
- Are environment setup and configuration manual? If different people configured different
environments, the variance is a direct result of manual processes. Start with
Manual Deployments.
- Do the failing tests depend on external services, filesystem paths, or system
configuration? If tests assume specific external state rather than declaring explicit
dependencies, the code’s coupling to its environment is the issue. Start with
Tightly Coupled Monolith.
Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.
Related Content
3 - High Coverage but Tests Miss Defects
Test coverage numbers look healthy but defects still reach production.
What you are seeing
Your dashboard shows 80% or 90% code coverage, but bugs keep getting through. Defects show up
in production that feel like they should have been caught. The team points to the coverage
number as proof that testing is solid, yet the results tell a different story.
People start losing trust in the test suite. Some developers stop running tests locally because
they do not believe the tests will catch anything useful. Others add more tests, pushing
coverage higher, without the defect rate improving.
Common causes
Inverted Test Pyramid
When most of your tests are end-to-end or integration tests, they exercise many code paths in a
single run - which inflates coverage numbers. But these tests often verify that a workflow
completes without errors, not that each piece of logic produces the correct result. A test that
clicks through a form and checks for a success message covers dozens of functions without
validating any of them in detail.
Read more: Inverted Test Pyramid
Pressure to Skip Testing
When teams face pressure to hit a coverage target, testing becomes theater. Developers write
tests with trivial assertions - checking that a function returns without throwing, or that a
value is not null - just to get the number up. The coverage metric looks healthy, but the tests
do not actually verify behavior. They exist to satisfy a gate, not to catch defects.
Read more: Pressure to Skip Testing
Code Coverage Mandates
When the organization gates the pipeline on a coverage target, teams optimize for the number
rather than for defect detection. Developers write assertion-free tests, cover trivial code, or
add single integration tests that execute hundreds of lines without validating any of them. The
coverage metric rises while the tests remain unable to catch meaningful defects.
Read more: Code Coverage Mandates
Manual Testing Only
When test automation is absent or minimal, teams sometimes generate superficial tests or rely on
coverage from integration-level runs that touch many lines without asserting meaningful outcomes.
The coverage tool counts every line that executes, regardless of whether any test validates the
result.
Read more: Manual Testing Only
How to narrow it down
- Do most tests assert on behavior and expected outcomes, or do they just verify that code
runs without errors? If tests mostly check for no-exceptions or non-null returns, the
problem is testing theater - tests written to hit a number, not to catch defects. Start with
Pressure to Skip Testing.
- Are the majority of your tests end-to-end or integration tests? If most of the suite runs
through a browser, API, or multi-service flow rather than testing units of logic directly,
start with Inverted Test Pyramid.
- Does the pipeline gate on a specific coverage percentage? If the team writes tests
primarily to keep coverage above a mandated threshold, start with
Code Coverage Mandates.
- Were tests added retroactively to meet a coverage target? If the bulk of tests were
written after the code to satisfy a coverage gate rather than to verify design decisions,
start with
Pressure to Skip Testing.
Ready to fix this? The most common cause is Code Coverage Mandates. Start with its How to Fix It section for week-by-week steps.
Related Content
4 - A Large Codebase Has No Automated Tests
Zero test coverage in a production system being actively modified. Nobody is confident enough to change the code safely.
What you are seeing
Every modification to this codebase is a gamble. The system has no automated tests. Changes are validated through manual testing, if they are validated at all. Developers work carefully but know that any change could trigger failures in code they did not touch, because the system has no seams and no isolation. The only way to know if a change works is to deploy it and observe what breaks.
Refactoring is effectively off the table. Improving the design of the code requires changing it in ways that should not alter behavior - but with no tests, there is no way to verify that behavior was preserved. Developers choose to add code around existing code rather than improve it, because change is unsafe. The codebase grows more complex with every feature because improving the underlying structure carries too much risk.
The team knows the situation is unsustainable but cannot see a path out. “We should write tests” appears in every retrospective. The problem is that adding tests to an untestable codebase requires refactoring first - and refactoring requires tests to do safely. The team is stuck in a loop with no obvious entry point.
Common causes
Manual testing only
The team has relied on manual testing as the primary quality gate. Automated tests were never required, never prioritized, and never resourced. The codebase was built without testability as a design constraint, which means the architecture does not accommodate automated testing without structural change.
Making the transition requires making a deliberate commitment: new code is always written with tests, existing code gets tests when it is modified, and high-risk areas are prioritized for retrofitted coverage. Over months, the areas of the codebase where developers can no longer safely make changes shrink, and the cycle of deploying to discover breakage is replaced by a test suite that catches failures before production.
Read more: Manual testing only
Tightly coupled monolith
Code without dependency injection, without interfaces, and without clear module boundaries cannot be tested without a major structural overhaul. Every function calls other functions directly. Every component reaches into every other component. Writing a test for one function requires instantiating the entire system.
Introducing seams - interfaces, dependency injection, module boundaries - makes code testable. This work is not glamorous and its value is invisible until tests start getting written. But it is the prerequisite for meaningful test coverage in a tightly coupled system. Once the seams exist, functions can be tested in isolation rather than requiring a full application instantiation - and developers stop needing to deploy to find out if a change is safe.
Read more: Tightly coupled monolith
Pressure to skip testing
If management has historically prioritized features over tests, the codebase will reflect that history. Tests were deferred sprint by sprint. Technical debt accumulated. The team that exists today is inheriting the decisions of teams that operated under different constraints, but the codebase carries the record of every time testing lost to deadline pressure.
Reversing this requires organizational commitment to treat test coverage as a delivery requirement, not as optional work that gets squeezed out when time is short. Without that commitment, the same pressure that created the untested codebase will prevent escaping it - and developers will keep gambling on every deploy.
Read more: Pressure to skip testing
How to narrow it down
- Can any single function in the codebase be tested without instantiating the entire application? If not, the architecture does not have the seams needed for unit tests. Start with Tightly coupled monolith.
- Has the team ever had a sustained period of writing tests as part of normal development? If not, the practice was never established. Start with Manual testing only.
- Did historical management decisions consistently deprioritize testing? If test debt accumulated from external pressure, the organizational habit needs to change before the technical situation can improve. Start with Pressure to skip testing.
Ready to fix this? The most common cause is Manual testing only. Start with its How to Fix It section for week-by-week steps.
5 - Refactoring Breaks Tests
Internal code changes that do not alter behavior cause widespread test failures.
What you are seeing
A developer renames a method, extracts a class, or reorganizes modules - changes that should not
affect external behavior. But dozens of tests fail. The failures are not catching real bugs.
They are breaking because the tests depend on implementation details that changed.
Developers start avoiding refactoring because the cost of updating tests is too high. Code
quality degrades over time because cleanup work is too expensive. When someone does refactor,
they spend more time fixing tests than improving the code.
Common causes
Inverted Test Pyramid
When the test suite is dominated by end-to-end and integration tests, those tests tend to be
tightly coupled to implementation details - CSS selectors, API response shapes, DOM structure,
or specific sequences of internal calls. A refactoring that changes none of the observable
behavior still breaks these tests because they assert on how the system works rather than what
it does.
Unit tests focused on behavior (“given this input, expect this output”) survive refactoring.
Tests coupled to implementation (“this method was called with these arguments”) do not.
Read more: Inverted Test Pyramid
Tightly Coupled Monolith
When components lack clear interfaces, tests reach into the internals of other modules. A
refactoring in module A breaks tests for module B - not because B’s behavior changed, but
because B’s tests were calling A’s internal methods directly. Without well-defined boundaries,
every internal change ripples across the test suite.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do the broken tests assert on internal method calls, mock interactions, or DOM structure?
If yes, the tests are coupled to implementation rather than behavior. This is a test design
issue - start with Inverted Test Pyramid for guidance
on building a behavior-focused test suite.
- Are the broken tests end-to-end or UI tests that fail because of layout or selector
changes? If yes, you have too many tests at the wrong level of the pyramid. Start with
Inverted Test Pyramid.
- Do the broken tests span multiple modules - testing code in one area but breaking because
of changes in another? If yes, the problem is missing boundaries between components. Start
with Tightly Coupled Monolith.
Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.
Related Content
6 - Test Environments Take Too Long to Reset Between Runs
The team cannot run the full regression suite on every change because resetting the test environment and database takes too long.
What you are seeing
The team has a regression test suite that covers critical business flows. Running the tests
themselves takes twenty minutes. Resetting the test environment - restoring the database to a
known state, restarting services, clearing caches, reloading reference data - takes another
forty minutes. The total cycle is an hour. With multiple teams queuing for the same environment,
a developer might wait half a day to get feedback on a single change.
The team makes a practical decision: run the full regression suite nightly, or before a release,
but not on every change. Individual changes get a subset of tests against a partially reset
environment. Bugs that depend on data state - stale records, unexpected reference data, leftover
test artifacts - slip through because the partial reset does not catch them. The full suite
catches them later, but by then several changes have been merged and isolating which one
introduced the regression takes a multi-person investigation.
Some teams stop running the full suite entirely. The reset time is so long that the suite
becomes a release gate rather than a development tool. Developers lose confidence in the
suite because they rarely see it run and the failures they do see are often environment
artifacts rather than real bugs.
Common causes
Shared Test Environments
When multiple teams share a single test environment, the environment is never in a clean state.
One team’s tests leave data behind. Another team’s tests depend on data that was just deleted.
Resetting the environment means restoring it to a state that works for all teams, which
requires coordination and takes longer than resetting a single-team environment.
The shared environment also creates queuing. Only one test run can use the environment at a
time. Each team waits for the previous run to finish and the environment to reset before
starting their own.
Read more: Shared Test Environments
Manual Regression Testing Gates
When the regression suite is treated as a manual checkpoint rather than an automated pipeline
stage, the environment setup is also manual or semi-automated. Scripts that restore the
database, restart services, and verify the environment is ready have accumulated over time
without being optimized. Nobody has invested in making the reset fast because the suite was
never intended to run on every change.
Read more: Manual Regression Testing Gates
Too Many Hard Dependencies in the Test Suite
When tests require live databases, running services, and real network connections for every
assertion, the environment reset is slow because every dependency must be restored to a known
state. A test that validates billing logic should not need a running payment gateway. A test
that checks order validation should not need a populated product catalog database.
The fix is to match each test to the right layer. Functional tests that verify business rules
use in-memory databases or controlled fixtures - no environment reset needed. Contract tests
verify service boundaries with virtual services instead of live instances. Only a small number
of end-to-end tests need the fully assembled environment, and those run outside the pipeline’s
critical path. When the pipeline’s critical path depends on heavyweight integration for every
assertion, the reset time is a direct consequence of testing at the wrong layer.
Read more: Inverted Test Pyramid
Testing Only at the End
When testing is deferred to a late stage - after development, after integration, before release
- the tests assume a fully assembled system with a production-like database. Resetting that
system is inherently slow because it involves restoring a large database, restarting multiple
services, and verifying cross-service connectivity. The tests were designed for a heavyweight
environment because they run at a heavyweight stage.
Tests designed to run early - functional tests with controlled data, contract tests between
services - do not need environment resets. They run in isolation with their own data fixtures.
Read more: Testing Only at the End
How to narrow it down
- Is the environment shared across multiple teams or test suites? If teams queue for a
single environment, the reset time is compounded by coordination. Start with
Shared Test Environments.
- Does the reset process involve restoring a large database from backup? If the database
restore is the bottleneck, the tests depend on global data state rather than controlling
their own data. Start with
Manual Regression Testing Gates
and refactor tests to use isolated data fixtures.
- Do most tests require live databases, running services, or network connections? If the
majority of tests need the fully assembled environment, the suite is testing at the wrong
layer. Functional tests with in-memory databases and virtual services for
external dependencies would eliminate the reset bottleneck for most assertions. Start with
Inverted Test Pyramid.
- Does the full suite only run before releases, not on every change? If the suite is a
release gate rather than a pipeline stage, it was designed for a different feedback loop.
Start with
Testing Only at the End and move
tests earlier in the pipeline.
Ready to fix this? The most common cause is Shared Test Environments. Start with its How to Fix It section for week-by-week steps.
Related Content
7 - Test Suite Is Too Slow to Run
The test suite takes 30 minutes or more. Developers stop running it locally and push without verifying.
What you are seeing
The full test suite takes 30 minutes, an hour, or longer. Developers do not run it locally because
they cannot afford to wait. Instead, they push their changes and let CI run the tests. Feedback
arrives long after the developer has moved on. If a test fails, the developer must context-switch
back, recall what they were doing, and debug the failure.
Some developers run only a subset of tests locally (the ones for their module) and skip the rest.
This catches some issues but misses integration problems between modules. Others skip local testing
entirely and treat the CI pipeline as their test runner, which overloads the shared pipeline and
increases wait times for everyone.
The team has discussed parallelizing the tests, splitting the suite, or adding more CI capacity.
These discussions stall because the root cause is not infrastructure. It is the shape of the test
suite itself.
Common causes
Inverted Test Pyramid
When the majority of tests are end-to-end or integration tests, the suite is inherently slow. E2E
tests launch browsers, start services, make network calls, and wait for responses. Each test takes
seconds or minutes instead of milliseconds. A suite of 500 E2E tests will always be slower than a
suite of 5,000 unit tests that verify the same logic at a lower level. The fix is not faster
hardware. It is moving test coverage down the pyramid.
Read more: Inverted Test Pyramid
Tightly Coupled Monolith
When the codebase has no clear module boundaries, tests cannot be scoped to individual components.
A test for one feature must set up the entire application because the feature depends on
everything. Test setup and teardown dominate execution time because there is no way to isolate the
system under test.
Read more: Tightly Coupled Monolith
Manual Testing Only
Sometimes the test suite is slow because the team added automated tests as an afterthought, using
E2E tests to backfill coverage for code that was not designed for unit testing. The resulting suite
is a collection of heavyweight tests that exercise the full stack for every scenario because the
code provides no lower-level testing seams.
Read more: Manual Testing Only
How to narrow it down
- What is the ratio of unit tests to E2E/integration tests? If E2E tests outnumber unit
tests, the test pyramid is inverted and the suite is slow by design. Start with
Inverted Test Pyramid.
- Can tests be run for a single module in isolation? If running one module’s tests requires
starting the entire application, the architecture prevents test isolation. Start with
Tightly Coupled Monolith.
- Were the automated tests added retroactively to a codebase with no testing seams? If tests
were bolted on after the fact using E2E tests because the code cannot be unit-tested, the
codebase needs refactoring for testability. Start with
Manual Testing Only.
Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.
Related Content
8 - Tests Interfere with Each Other Through Shared Data
Tests share mutable state in a common database. Results vary by run order, making failures unreliable signals of real bugs.
What you are seeing
Your test suite is technically running, but the results are a coin flip. A test that passed yesterday fails today because another test ran first and left dirty data in the shared database. You spend thirty minutes debugging a failure only to find the root cause was a record inserted by an unrelated test two hours ago. When you rerun the suite in isolation, everything passes. When you run it in CI with the full suite, it fails at random.
Shared database state is the source of the chaos. The database schema and seed data were set up once, years ago, by someone who has since left. Nobody is sure what state the database is supposed to be in before any given test. Some tests clean up after themselves; most do not. Some tests depend on records created by other tests. The execution order matters, but nobody explicitly controls it - so the suite is fragile by construction.
The downstream effect is that your team has stopped trusting test failures. When a red build appears, the first instinct is not “there is a bug” but “someone broke the test data again.” You rerun the build, it goes green, and you ship. Real bugs make it to production because the signal-to-noise ratio of your test suite has collapsed.
Common causes
Manual testing only
Teams that have relied on manual testing tend to reach for a shared database as the natural extension of how testers have always worked - against a shared test environment. When automated tests are added later, they inherit the same model: one environment, one database, shared by everyone. Nobody designed a data strategy; it evolved from how the team already worked.
When teams shift to isolated test data - each test owns and tears down its own data - interference disappears. Tests become deterministic. A failing test means code is broken, not the environment.
Read more: Manual testing only
Inverted test pyramid
When most automated tests are end-to-end or integration tests that exercise a real database, test data problems compound. Each test requires realistic, complex data to be in place. The more tests that depend on a shared database, the more opportunities for interference and the harder it becomes to manage the data lifecycle.
Shifting toward a pyramid with a large base of unit tests reduces database dependency dramatically. Unit tests run against in-memory structures and do not touch shared state. The integration and end-to-end tests that remain can be designed more carefully with isolated, purpose-built datasets. With fewer tests competing for shared database rows, the random CI failures that triggered “just rerun it” reflexes become rare, and a red build is a signal worth investigating.
Read more: Inverted test pyramid
Snowflake environments
When test environments are hand-crafted and not reproducible from code, database state drifts over time. Schema migrations get applied inconsistently. Seed data scripts run at different times in different environments. Each environment develops its own data personality, and tests written against one environment fail on another.
Reproducible environments - created from code on demand and destroyed after use - eliminate drift. When the database is provisioned fresh from a migration script and a known seed set for each test run, the starting state is always predictable. Tests that produced different results on different machines or at different times start producing consistent results, and the team can stop dismissing CI failures as environment noise.
Read more: Snowflake environments
How to narrow it down
- Do tests pass when run individually but fail when run together? Mutual interference from shared mutable state is the most likely cause. Start with Inverted test pyramid.
- Does the test suite pass on one machine but fail in CI? The test environment differs from the developer’s local database. Start with Snowflake environments.
- Is there no documented strategy for setting up and tearing down test data? The team never established a data strategy. Start with Manual testing only.
Ready to fix this? The most common cause is Inverted test pyramid. Start with its How to Fix It section for week-by-week steps.
9 - Tests Randomly Pass or Fail
The pipeline fails, the developer reruns it without changing anything, and it passes.
What you are seeing
A developer pushes a change. The pipeline fails on a test they did not touch, in a module they
did not change. They click rerun. It passes. They merge. This happens multiple times a day across
the team. Nobody investigates failures on the first occurrence because the odds favor flakiness
over a real problem.
The team has adapted: retry-until-green is a routine step, not an exception. Some pipelines are
configured to automatically rerun failed tests. Tests are tagged as “known flaky” and skipped.
Real regressions hide behind the noise because the team has been trained to ignore failures.
Common causes
Inverted Test Pyramid
When the test suite is dominated by end-to-end tests, flakiness is structural. E2E tests depend
on network connectivity, shared test environments, external service availability, and browser
rendering timing. Any of these can produce a different result on each run. A suite built mostly
on E2E tests will always be flaky because it is built on non-deterministic foundations.
Replacing E2E tests with functional tests that use test doubles for external dependencies makes
the suite deterministic by design. The test produces the same result every time because it
controls all its inputs.
Read more: Inverted Test Pyramid
Snowflake Environments
When the CI environment is configured differently from other environments - or drifts over time -
tests pass locally but fail in CI, or pass in CI on Tuesday but fail on Wednesday. The
inconsistency is not in the test or the code but in the environment the test runs in.
Tests that depend on specific environment configurations, installed packages, file system layout,
or network access are vulnerable to environment drift. Infrastructure-as-code eliminates this
class of flakiness by ensuring environments are identical and reproducible.
Read more: Snowflake Environments
Tightly Coupled Monolith
When components share mutable state - a database, a cache, a filesystem directory - tests that
run concurrently or in a specific order can interfere with each other. Test A writes to a shared
table. Test B reads from the same table and gets unexpected data. The tests pass individually
but fail together, or pass in one order but fail in another.
Without clear component boundaries, tests cannot be isolated. The flakiness is a symptom of
architectural coupling, not a testing problem.
Read more: Tightly Coupled Monolith
How to narrow it down
- Do the flaky tests hit real external services or shared environments? If yes, the tests
are non-deterministic by design. Start with
Inverted Test Pyramid and replace them with
functional tests using test doubles.
- Do tests pass locally but fail in CI, or vice versa? If yes, the environments differ.
Start with Snowflake Environments.
- Do tests pass individually but fail when run together, or fail in a different order? If
yes, tests share mutable state. Start with
Tightly Coupled Monolith for the
architectural root cause, and isolate test data as an immediate fix.
Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.
Related Content