This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Test Suite Problems

Symptoms related to test reliability, coverage effectiveness, speed, and environment consistency.

1: AI-Generated Code Ships Without Developer Understanding
2: Tests Pass in One Environment but Fail in Another
3: High Coverage but Tests Miss Defects
4: A Large Codebase Has No Automated Tests
5: Refactoring Breaks Tests
6: Test Environments Take Too Long to Reset Between Runs
7: Test Suite Is Too Slow to Run
8: Tests Interfere with Each Other Through Shared Data
9: Tests Randomly Pass or Fail

These symptoms indicate problems with your testing strategy. Unreliable or slow tests erode confidence and slow delivery. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Testing Anti-Patterns, Pipeline Anti-Patterns

Related guide: Testing Fundamentals

1 - AI-Generated Code Ships Without Developer Understanding

Developers accept AI-generated code without verifying it against acceptance criteria, and functional bugs and security vulnerabilities reach production unchallenged.

What you are seeing

A developer asks an AI assistant to implement a feature. The generated code looks plausible. The tests pass. The developer commits it. Two weeks later, a security review finds the code accepts unsanitized input in a path nobody specified as an acceptance criterion. When asked what the change was supposed to do, the developer says, “It implements the feature.” When asked how they validated it, they say, “The tests passed.”

This is not an occasional gap. It is a pattern. Developers use AI to produce code faster, but they do not define what “correct” means before generating code, verify the output against specific acceptance criteria, or consider how they would detect a failure in production. The code compiles. The tests pass. Nobody validated it against the actual requirements.

The symptoms compound over time. Defects appear in AI-generated code that the team cannot diagnose quickly because nobody defined what the code was supposed to do beyond “implement the feature.” Fixes are made by asking the AI to fix its own output without re-examining the original acceptance criteria. Security vulnerabilities - injection flaws, broken access controls, exposed credentials - ship because nobody asked “what are the security constraints for this change?” before or after generation.

Common causes

Rubber-Stamping AI-Generated Code

When there is no expectation that developers own what a change does and how they validated it - regardless of who or what wrote the code - AI output gets the same cursory glance as a trivial formatting change. The team treats “AI wrote it and the tests pass” as sufficient evidence of correctness. It is not. Passing tests prove the code satisfies the test cases. They do not prove the code meets the actual requirements or handles the constraints the team cares about.

Read more: Rubber-Stamping AI-Generated Code

Missing Acceptance Criteria

When the work item lacks concrete acceptance criteria - specific inputs, expected outputs, security constraints, edge cases - neither the developer nor the AI has a clear target. The AI generates something that looks right. The developer has no checklist to verify it against. The review is a subjective “does this seem okay?” rather than an objective “does this satisfy every stated requirement?”

Read more: Monolithic Work Items

Inverted Test Pyramid

When the test suite relies heavily on end-to-end tests and lacks targeted unit and functional tests, AI-generated code can pass the suite without its internal logic being verified. A comprehensive functional test suite would catch the cases where the AI’s implementation diverges from the domain rules. Without it, “tests pass” is a weak signal.

Read more: Inverted Test Pyramid

How to narrow it down

Can developers explain what their recent changes do and how they validated them? Pick three recent AI-assisted commits at random and ask the committing developer: what does this change accomplish, what acceptance criteria did you verify, and how would you detect if it were wrong? If they cannot answer, the review process is not catching unexamined code. Start with Rubber-Stamping AI-Generated Code.
Do your work items include specific, testable acceptance criteria before implementation starts? If acceptance criteria are vague or added after the fact, neither the AI nor the developer has a clear target. Start with Monolithic Work Items.
Does your test suite include functional tests that verify business rules with specific inputs and outputs? If the suite is mostly end-to-end or integration tests, AI-generated code can satisfy them without being correct at the rule level. Start with Inverted Test Pyramid.

Ready to fix this? The most common cause is Rubber-Stamping AI-Generated Code. Start with its How to Fix It section for week-by-week steps.

Rubber-Stamping AI-Generated Code - The anti-pattern of accepting AI output without critical review
Pitfalls and Metrics - Common failure modes when teams adopt AI coding tools
Testing Fundamentals - Building a test suite that catches logic errors regardless of who wrote the code
Inverted Test Pyramid - Why end-to-end tests alone cannot catch AI-generated logic errors
AI Adoption Roadmap - Prerequisites for safe AI-assisted development

2 - Tests Pass in One Environment but Fail in Another

Tests pass locally but fail in CI, or pass in CI but fail in staging. Environment differences cause unpredictable failures.

What you are seeing

A developer runs the tests locally and they pass. They push to CI and the same tests fail. Or the CI pipeline is green but the tests fail in the staging environment. The failures are not caused by a code defect. They are caused by differences between environments: a different OS version, a different database version, a different timezone setting, a missing environment variable, or a service that is available locally but not in CI.

The developer spends time debugging the failure and discovers the root cause is environmental, not logical. They add a workaround (skip the test in CI, add an environment check, adjust a timeout) and move on. The workaround accumulates over time. The test suite becomes littered with environment-specific conditionals and skipped tests.

The team loses confidence in the test suite because results depend on where the tests run rather than whether the code is correct.

Common causes

Snowflake Environments

When each environment is configured by hand and maintained independently, they drift apart over time. The developer’s laptop has one version of a database driver. The CI server has another. The staging environment has a third. These differences are invisible until a test exercises a code path that behaves differently across versions. The fix is not to harmonize configurations manually (they will drift again) but to provision all environments from the same infrastructure code.

Read more: Snowflake Environments

Manual Deployments

When deployment and environment setup are manual processes, subtle differences creep in. One developer installed a dependency a particular way. The CI server was configured by a different person with slightly different settings. The staging environment was set up months ago and has not been updated. Manual processes are never identical twice, and the variance causes environment- dependent behavior.

Read more: Manual Deployments

Tightly Coupled Monolith

When the application has hidden dependencies on external state (filesystem paths, network services, system configuration), tests that work in one environment fail in another because the external state differs. Well-isolated code with explicit dependencies is portable across environments. Tightly coupled code that reaches into its environment for implicit dependencies is fragile.

Read more: Tightly Coupled Monolith

How to narrow it down

Are all environments provisioned from the same infrastructure code? If not, environment drift is the most likely cause. Start with Snowflake Environments.
Are environment setup and configuration manual? If different people configured different environments, the variance is a direct result of manual processes. Start with Manual Deployments.
Do the failing tests depend on external services, filesystem paths, or system configuration? If tests assume specific external state rather than declaring explicit dependencies, the code’s coupling to its environment is the issue. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.

Tests Randomly Pass or Fail - Environment differences are a common cause of flaky tests
It Works on My Machine - The same root cause affects both testing and development
Snowflake Environments - Eliminating environment variance
Production-Like Environments - Making all environments consistent
Testing Fundamentals - Designing tests that are environment-independent

3 - High Coverage but Tests Miss Defects

Test coverage numbers look healthy but defects still reach production.

What you are seeing

Your dashboard shows 80% or 90% code coverage, but bugs keep getting through. Defects show up in production that feel like they should have been caught. The team points to the coverage number as proof that testing is solid, yet the results tell a different story.

People start losing trust in the test suite. Some developers stop running tests locally because they do not believe the tests will catch anything useful. Others add more tests, pushing coverage higher, without the defect rate improving.

Common causes

Inverted Test Pyramid

When most of your tests are end-to-end or integration tests, they exercise many code paths in a single run - which inflates coverage numbers. But these tests often verify that a workflow completes without errors, not that each piece of logic produces the correct result. A test that clicks through a form and checks for a success message covers dozens of functions without validating any of them in detail.

Read more: Inverted Test Pyramid

Pressure to Skip Testing

When teams face pressure to hit a coverage target, testing becomes theater. Developers write tests with trivial assertions - checking that a function returns without throwing, or that a value is not null - just to get the number up. The coverage metric looks healthy, but the tests do not actually verify behavior. They exist to satisfy a gate, not to catch defects.

Read more: Pressure to Skip Testing

Code Coverage Mandates

When the organization gates the pipeline on a coverage target, teams optimize for the number rather than for defect detection. Developers write assertion-free tests, cover trivial code, or add single integration tests that execute hundreds of lines without validating any of them. The coverage metric rises while the tests remain unable to catch meaningful defects.

Read more: Code Coverage Mandates

Manual Testing Only

When test automation is absent or minimal, teams sometimes generate superficial tests or rely on coverage from integration-level runs that touch many lines without asserting meaningful outcomes. The coverage tool counts every line that executes, regardless of whether any test validates the result.

Read more: Manual Testing Only

How to narrow it down

Do most tests assert on behavior and expected outcomes, or do they just verify that code runs without errors? If tests mostly check for no-exceptions or non-null returns, the problem is testing theater - tests written to hit a number, not to catch defects. Start with Pressure to Skip Testing.
Are the majority of your tests end-to-end or integration tests? If most of the suite runs through a browser, API, or multi-service flow rather than testing units of logic directly, start with Inverted Test Pyramid.
Does the pipeline gate on a specific coverage percentage? If the team writes tests primarily to keep coverage above a mandated threshold, start with Code Coverage Mandates.
Were tests added retroactively to meet a coverage target? If the bulk of tests were written after the code to satisfy a coverage gate rather than to verify design decisions, start with Pressure to Skip Testing.

Ready to fix this? The most common cause is Code Coverage Mandates. Start with its How to Fix It section for week-by-week steps.

Refactoring Breaks Tests - Another sign that tests verify implementation instead of behavior
Code Coverage Mandates - When coverage targets incentivize the wrong testing behavior
Testing Fundamentals - Building tests that catch real defects
Unit Tests - Writing fast, behavior-focused tests
Change Fail Rate - Measure defect escape rate instead of coverage percentage
ACD - How ineffective tests undermine the acceptance criteria that agents depend on

4 - A Large Codebase Has No Automated Tests

Zero test coverage in a production system being actively modified. Nobody is confident enough to change the code safely.

What you are seeing

Every modification to this codebase is a gamble. The system has no automated tests. Changes are validated through manual testing, if they are validated at all. Developers work carefully but know that any change could trigger failures in code they did not touch, because the system has no seams and no isolation. The only way to know if a change works is to deploy it and observe what breaks.

Refactoring is effectively off the table. Improving the design of the code requires changing it in ways that should not alter behavior - but with no tests, there is no way to verify that behavior was preserved. Developers choose to add code around existing code rather than improve it, because change is unsafe. The codebase grows more complex with every feature because improving the underlying structure carries too much risk.

The team knows the situation is unsustainable but cannot see a path out. “We should write tests” appears in every retrospective. The problem is that adding tests to an untestable codebase requires refactoring first - and refactoring requires tests to do safely. The team is stuck in a loop with no obvious entry point.

Common causes

Manual testing only

The team has relied on manual testing as the primary quality gate. Automated tests were never required, never prioritized, and never resourced. The codebase was built without testability as a design constraint, which means the architecture does not accommodate automated testing without structural change.

Making the transition requires making a deliberate commitment: new code is always written with tests, existing code gets tests when it is modified, and high-risk areas are prioritized for retrofitted coverage. Over months, the areas of the codebase where developers can no longer safely make changes shrink, and the cycle of deploying to discover breakage is replaced by a test suite that catches failures before production.

Read more: Manual testing only

Tightly coupled monolith

Code without dependency injection, without interfaces, and without clear module boundaries cannot be tested without a major structural overhaul. Every function calls other functions directly. Every component reaches into every other component. Writing a test for one function requires instantiating the entire system.

Introducing seams - interfaces, dependency injection, module boundaries - makes code testable. This work is not glamorous and its value is invisible until tests start getting written. But it is the prerequisite for meaningful test coverage in a tightly coupled system. Once the seams exist, functions can be tested in isolation rather than requiring a full application instantiation - and developers stop needing to deploy to find out if a change is safe.

Read more: Tightly coupled monolith

Pressure to skip testing

If management has historically prioritized features over tests, the codebase will reflect that history. Tests were deferred sprint by sprint. Technical debt accumulated. The team that exists today is inheriting the decisions of teams that operated under different constraints, but the codebase carries the record of every time testing lost to deadline pressure.

Reversing this requires organizational commitment to treat test coverage as a delivery requirement, not as optional work that gets squeezed out when time is short. Without that commitment, the same pressure that created the untested codebase will prevent escaping it - and developers will keep gambling on every deploy.

Read more: Pressure to skip testing

How to narrow it down

Can any single function in the codebase be tested without instantiating the entire application? If not, the architecture does not have the seams needed for unit tests. Start with Tightly coupled monolith.
Has the team ever had a sustained period of writing tests as part of normal development? If not, the practice was never established. Start with Manual testing only.
Did historical management decisions consistently deprioritize testing? If test debt accumulated from external pressure, the organizational habit needs to change before the technical situation can improve. Start with Pressure to skip testing.

Ready to fix this? The most common cause is Manual testing only. Start with its How to Fix It section for week-by-week steps.

5 - Refactoring Breaks Tests

Internal code changes that do not alter behavior cause widespread test failures.

What you are seeing

A developer renames a method, extracts a class, or reorganizes modules - changes that should not affect external behavior. But dozens of tests fail. The failures are not catching real bugs. They are breaking because the tests depend on implementation details that changed.

Developers start avoiding refactoring because the cost of updating tests is too high. Code quality degrades over time because cleanup work is too expensive. When someone does refactor, they spend more time fixing tests than improving the code.

Common causes

Inverted Test Pyramid

When the test suite is dominated by end-to-end and integration tests, those tests tend to be tightly coupled to implementation details - CSS selectors, API response shapes, DOM structure, or specific sequences of internal calls. A refactoring that changes none of the observable behavior still breaks these tests because they assert on how the system works rather than what it does.

Unit tests focused on behavior (“given this input, expect this output”) survive refactoring. Tests coupled to implementation (“this method was called with these arguments”) do not.

Read more: Inverted Test Pyramid

Tightly Coupled Monolith

When components lack clear interfaces, tests reach into the internals of other modules. A refactoring in module A breaks tests for module B - not because B’s behavior changed, but because B’s tests were calling A’s internal methods directly. Without well-defined boundaries, every internal change ripples across the test suite.

Read more: Tightly Coupled Monolith

How to narrow it down

Do the broken tests assert on internal method calls, mock interactions, or DOM structure? If yes, the tests are coupled to implementation rather than behavior. This is a test design issue - start with Inverted Test Pyramid for guidance on building a behavior-focused test suite.
Are the broken tests end-to-end or UI tests that fail because of layout or selector changes? If yes, you have too many tests at the wrong level of the pyramid. Start with Inverted Test Pyramid.
Do the broken tests span multiple modules - testing code in one area but breaking because of changes in another? If yes, the problem is missing boundaries between components. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

High Coverage but Tests Miss Defects - Tests that verify implementation often create high coverage without catching bugs
Inverted Test Pyramid - Over-reliance on integration and E2E tests amplifies this problem
Testing Fundamentals - Test architecture that supports refactoring
Unit Tests - Black box testing that survives internal changes
Test Doubles - Using test doubles without coupling to implementation

6 - Test Environments Take Too Long to Reset Between Runs

The team cannot run the full regression suite on every change because resetting the test environment and database takes too long.

What you are seeing

The team has a regression test suite that covers critical business flows. Running the tests themselves takes twenty minutes. Resetting the test environment - restoring the database to a known state, restarting services, clearing caches, reloading reference data - takes another forty minutes. The total cycle is an hour. With multiple teams queuing for the same environment, a developer might wait half a day to get feedback on a single change.

The team makes a practical decision: run the full regression suite nightly, or before a release, but not on every change. Individual changes get a subset of tests against a partially reset environment. Bugs that depend on data state - stale records, unexpected reference data, leftover test artifacts - slip through because the partial reset does not catch them. The full suite catches them later, but by then several changes have been merged and isolating which one introduced the regression takes a multi-person investigation.

Some teams stop running the full suite entirely. The reset time is so long that the suite becomes a release gate rather than a development tool. Developers lose confidence in the suite because they rarely see it run and the failures they do see are often environment artifacts rather than real bugs.

Common causes

Shared Test Environments

When multiple teams share a single test environment, the environment is never in a clean state. One team’s tests leave data behind. Another team’s tests depend on data that was just deleted. Resetting the environment means restoring it to a state that works for all teams, which requires coordination and takes longer than resetting a single-team environment.

The shared environment also creates queuing. Only one test run can use the environment at a time. Each team waits for the previous run to finish and the environment to reset before starting their own.

Read more: Shared Test Environments

Manual Regression Testing Gates

When the regression suite is treated as a manual checkpoint rather than an automated pipeline stage, the environment setup is also manual or semi-automated. Scripts that restore the database, restart services, and verify the environment is ready have accumulated over time without being optimized. Nobody has invested in making the reset fast because the suite was never intended to run on every change.

Read more: Manual Regression Testing Gates

Too Many Hard Dependencies in the Test Suite

When tests require live databases, running services, and real network connections for every assertion, the environment reset is slow because every dependency must be restored to a known state. A test that validates billing logic should not need a running payment gateway. A test that checks order validation should not need a populated product catalog database.

The fix is to match each test to the right layer. Functional tests that verify business rules use in-memory databases or controlled fixtures - no environment reset needed. Contract tests verify service boundaries with virtual services instead of live instances. Only a small number of end-to-end tests need the fully assembled environment, and those run outside the pipeline’s critical path. When the pipeline’s critical path depends on heavyweight integration for every assertion, the reset time is a direct consequence of testing at the wrong layer.

Read more: Inverted Test Pyramid

Testing Only at the End

When testing is deferred to a late stage - after development, after integration, before release

the tests assume a fully assembled system with a production-like database. Resetting that system is inherently slow because it involves restoring a large database, restarting multiple services, and verifying cross-service connectivity. The tests were designed for a heavyweight environment because they run at a heavyweight stage.

Tests designed to run early - functional tests with controlled data, contract tests between services - do not need environment resets. They run in isolation with their own data fixtures.

Read more: Testing Only at the End

How to narrow it down

Is the environment shared across multiple teams or test suites? If teams queue for a single environment, the reset time is compounded by coordination. Start with Shared Test Environments.
Does the reset process involve restoring a large database from backup? If the database restore is the bottleneck, the tests depend on global data state rather than controlling their own data. Start with Manual Regression Testing Gates and refactor tests to use isolated data fixtures.
Do most tests require live databases, running services, or network connections? If the majority of tests need the fully assembled environment, the suite is testing at the wrong layer. Functional tests with in-memory databases and virtual services for external dependencies would eliminate the reset bottleneck for most assertions. Start with Inverted Test Pyramid.
Does the full suite only run before releases, not on every change? If the suite is a release gate rather than a pipeline stage, it was designed for a different feedback loop. Start with Testing Only at the End and move tests earlier in the pipeline.

Ready to fix this? The most common cause is Shared Test Environments. Start with its How to Fix It section for week-by-week steps.

Tests Pass in One Environment but Fail in Another - Related symptom caused by environment inconsistency
Test Suite Is Too Slow to Run - Companion symptom where the tests themselves are slow, not just the reset
Inverted Test Pyramid - Too many tests at the E2E layer requiring full environment setup
Test Doubles - Virtual services and in-memory replacements for external dependencies
Shared Test Environments - The most common root cause of long reset times
Manual Regression Testing Gates - Treating regression as a manual checkpoint rather than automated feedback
Production-Like Environments - Designing environments that are both realistic and fast to provision
Testing Fundamentals - Building a test strategy that does not depend on slow environment resets

7 - Test Suite Is Too Slow to Run

The test suite takes 30 minutes or more. Developers stop running it locally and push without verifying.

What you are seeing

The full test suite takes 30 minutes, an hour, or longer. Developers do not run it locally because they cannot afford to wait. Instead, they push their changes and let CI run the tests. Feedback arrives long after the developer has moved on. If a test fails, the developer must context-switch back, recall what they were doing, and debug the failure.

Some developers run only a subset of tests locally (the ones for their module) and skip the rest. This catches some issues but misses integration problems between modules. Others skip local testing entirely and treat the CI pipeline as their test runner, which overloads the shared pipeline and increases wait times for everyone.

The team has discussed parallelizing the tests, splitting the suite, or adding more CI capacity. These discussions stall because the root cause is not infrastructure. It is the shape of the test suite itself.

Common causes

Inverted Test Pyramid

When the majority of tests are end-to-end or integration tests, the suite is inherently slow. E2E tests launch browsers, start services, make network calls, and wait for responses. Each test takes seconds or minutes instead of milliseconds. A suite of 500 E2E tests will always be slower than a suite of 5,000 unit tests that verify the same logic at a lower level. The fix is not faster hardware. It is moving test coverage down the pyramid.

Read more: Inverted Test Pyramid

Tightly Coupled Monolith

When the codebase has no clear module boundaries, tests cannot be scoped to individual components. A test for one feature must set up the entire application because the feature depends on everything. Test setup and teardown dominate execution time because there is no way to isolate the system under test.

Read more: Tightly Coupled Monolith

Manual Testing Only

Sometimes the test suite is slow because the team added automated tests as an afterthought, using E2E tests to backfill coverage for code that was not designed for unit testing. The resulting suite is a collection of heavyweight tests that exercise the full stack for every scenario because the code provides no lower-level testing seams.

Read more: Manual Testing Only

How to narrow it down

What is the ratio of unit tests to E2E/integration tests? If E2E tests outnumber unit tests, the test pyramid is inverted and the suite is slow by design. Start with Inverted Test Pyramid.
Can tests be run for a single module in isolation? If running one module’s tests requires starting the entire application, the architecture prevents test isolation. Start with Tightly Coupled Monolith.
Were the automated tests added retroactively to a codebase with no testing seams? If tests were bolted on after the fact using E2E tests because the code cannot be unit-tested, the codebase needs refactoring for testability. Start with Manual Testing Only.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

Pipelines Take Too Long - Slow tests are the most common cause of slow pipelines
Feedback Takes Hours Instead of Minutes - Slow suites force developers into long feedback loops
Inverted Test Pyramid - Too many slow tests at the wrong level
Testing Fundamentals - Rebalancing the test pyramid for speed
Build Duration - Track pipeline speed as a first-class metric

8 - Tests Interfere with Each Other Through Shared Data

Tests share mutable state in a common database. Results vary by run order, making failures unreliable signals of real bugs.

What you are seeing

Your test suite is technically running, but the results are a coin flip. A test that passed yesterday fails today because another test ran first and left dirty data in the shared database. You spend thirty minutes debugging a failure only to find the root cause was a record inserted by an unrelated test two hours ago. When you rerun the suite in isolation, everything passes. When you run it in CI with the full suite, it fails at random.

Shared database state is the source of the chaos. The database schema and seed data were set up once, years ago, by someone who has since left. Nobody is sure what state the database is supposed to be in before any given test. Some tests clean up after themselves; most do not. Some tests depend on records created by other tests. The execution order matters, but nobody explicitly controls it - so the suite is fragile by construction.

The downstream effect is that your team has stopped trusting test failures. When a red build appears, the first instinct is not “there is a bug” but “someone broke the test data again.” You rerun the build, it goes green, and you ship. Real bugs make it to production because the signal-to-noise ratio of your test suite has collapsed.

Common causes

Manual testing only

Teams that have relied on manual testing tend to reach for a shared database as the natural extension of how testers have always worked - against a shared test environment. When automated tests are added later, they inherit the same model: one environment, one database, shared by everyone. Nobody designed a data strategy; it evolved from how the team already worked.

When teams shift to isolated test data - each test owns and tears down its own data - interference disappears. Tests become deterministic. A failing test means code is broken, not the environment.

Read more: Manual testing only

Inverted test pyramid

When most automated tests are end-to-end or integration tests that exercise a real database, test data problems compound. Each test requires realistic, complex data to be in place. The more tests that depend on a shared database, the more opportunities for interference and the harder it becomes to manage the data lifecycle.

Shifting toward a pyramid with a large base of unit tests reduces database dependency dramatically. Unit tests run against in-memory structures and do not touch shared state. The integration and end-to-end tests that remain can be designed more carefully with isolated, purpose-built datasets. With fewer tests competing for shared database rows, the random CI failures that triggered “just rerun it” reflexes become rare, and a red build is a signal worth investigating.

Read more: Inverted test pyramid

Snowflake environments

When test environments are hand-crafted and not reproducible from code, database state drifts over time. Schema migrations get applied inconsistently. Seed data scripts run at different times in different environments. Each environment develops its own data personality, and tests written against one environment fail on another.

Reproducible environments - created from code on demand and destroyed after use - eliminate drift. When the database is provisioned fresh from a migration script and a known seed set for each test run, the starting state is always predictable. Tests that produced different results on different machines or at different times start producing consistent results, and the team can stop dismissing CI failures as environment noise.

Read more: Snowflake environments

How to narrow it down

Do tests pass when run individually but fail when run together? Mutual interference from shared mutable state is the most likely cause. Start with Inverted test pyramid.
Does the test suite pass on one machine but fail in CI? The test environment differs from the developer’s local database. Start with Snowflake environments.
Is there no documented strategy for setting up and tearing down test data? The team never established a data strategy. Start with Manual testing only.

Ready to fix this? The most common cause is Inverted test pyramid. Start with its How to Fix It section for week-by-week steps.

9 - Tests Randomly Pass or Fail

The pipeline fails, the developer reruns it without changing anything, and it passes.

What you are seeing

A developer pushes a change. The pipeline fails on a test they did not touch, in a module they did not change. They click rerun. It passes. They merge. This happens multiple times a day across the team. Nobody investigates failures on the first occurrence because the odds favor flakiness over a real problem.

The team has adapted: retry-until-green is a routine step, not an exception. Some pipelines are configured to automatically rerun failed tests. Tests are tagged as “known flaky” and skipped. Real regressions hide behind the noise because the team has been trained to ignore failures.

Common causes

Inverted Test Pyramid

When the test suite is dominated by end-to-end tests, flakiness is structural. E2E tests depend on network connectivity, shared test environments, external service availability, and browser rendering timing. Any of these can produce a different result on each run. A suite built mostly on E2E tests will always be flaky because it is built on non-deterministic foundations.

Replacing E2E tests with functional tests that use test doubles for external dependencies makes the suite deterministic by design. The test produces the same result every time because it controls all its inputs.

Read more: Inverted Test Pyramid

Snowflake Environments

When the CI environment is configured differently from other environments - or drifts over time - tests pass locally but fail in CI, or pass in CI on Tuesday but fail on Wednesday. The inconsistency is not in the test or the code but in the environment the test runs in.

Tests that depend on specific environment configurations, installed packages, file system layout, or network access are vulnerable to environment drift. Infrastructure-as-code eliminates this class of flakiness by ensuring environments are identical and reproducible.

Read more: Snowflake Environments

Tightly Coupled Monolith

When components share mutable state - a database, a cache, a filesystem directory - tests that run concurrently or in a specific order can interfere with each other. Test A writes to a shared table. Test B reads from the same table and gets unexpected data. The tests pass individually but fail together, or pass in one order but fail in another.

Without clear component boundaries, tests cannot be isolated. The flakiness is a symptom of architectural coupling, not a testing problem.

Read more: Tightly Coupled Monolith

How to narrow it down

Do the flaky tests hit real external services or shared environments? If yes, the tests are non-deterministic by design. Start with Inverted Test Pyramid and replace them with functional tests using test doubles.
Do tests pass locally but fail in CI, or vice versa? If yes, the environments differ. Start with Snowflake Environments.
Do tests pass individually but fail when run together, or fail in a different order? If yes, tests share mutable state. Start with Tightly Coupled Monolith for the architectural root cause, and isolate test data as an immediate fix.

Ready to fix this? The most common cause is Inverted Test Pyramid. Start with its How to Fix It section for week-by-week steps.

Tests Pass in One Environment but Fail in Another - Environment differences cause similar non-determinism
Test Suite Is Too Slow to Run - Flaky tests compound slow feedback loops
Inverted Test Pyramid - The most common structural cause of flaky tests
Testing Fundamentals - Building a fast, reliable test suite
Change Fail Rate - Track whether test reliability improvements reduce production failures

Test Suite Problems

How to use this section

1 - AI-Generated Code Ships Without Developer Understanding

What you are seeing

Common causes

Rubber-Stamping AI-Generated Code

Missing Acceptance Criteria

Inverted Test Pyramid

How to narrow it down

Related Content

2 - Tests Pass in One Environment but Fail in Another

What you are seeing

Common causes

Snowflake Environments

Manual Deployments

Tightly Coupled Monolith

How to narrow it down

Related Content

3 - High Coverage but Tests Miss Defects

What you are seeing

Common causes

Inverted Test Pyramid

Pressure to Skip Testing

Code Coverage Mandates

Manual Testing Only

How to narrow it down

Related Content

4 - A Large Codebase Has No Automated Tests

What you are seeing

Common causes

Manual testing only

Tightly coupled monolith

Pressure to skip testing

How to narrow it down

5 - Refactoring Breaks Tests

What you are seeing

Common causes

Inverted Test Pyramid

Tightly Coupled Monolith

How to narrow it down

Related Content

6 - Test Environments Take Too Long to Reset Between Runs

What you are seeing

Common causes

Shared Test Environments

Manual Regression Testing Gates

Too Many Hard Dependencies in the Test Suite

Testing Only at the End

How to narrow it down

Related Content

7 - Test Suite Is Too Slow to Run

What you are seeing

Common causes

Inverted Test Pyramid

Tightly Coupled Monolith

Manual Testing Only

How to narrow it down

Related Content

8 - Tests Interfere with Each Other Through Shared Data

What you are seeing

Common causes

Manual testing only

Inverted test pyramid

Snowflake environments

How to narrow it down

9 - Tests Randomly Pass or Fail

What you are seeing

Common causes

Inverted Test Pyramid

Snowflake Environments

Tightly Coupled Monolith

How to narrow it down

Related Content