These anti-patterns affect how teams build confidence that their code is safe to deploy. They create slow pipelines, flaky feedback, and manual gates that prevent the continuous flow of changes to production.
This is the multi-page printable view of this section. Click here to print.
Testing
- 1: Manual Testing Only
- 2: Manual Regression Testing Gates
- 3: Testing Only at the End
- 4: Inverted Test Pyramid
- 5: Code Coverage Mandates
- 6: QA Signoff as a Release Gate
- 7: No Contract Testing Between Services
- 8: Rubber-Stamping AI-Generated Code
- 9: Manually Triggered Tests
1 - Manual Testing Only
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
The team deploys by manually verifying things work. Someone clicks through the application, checks a few screens, and declares it good. There is no test suite. No test runner configured. No test directory in the repository. The CI server, if one exists, builds the code and stops there.
When a developer asks “how do I know if my change broke something?” the answer is either “you don’t” or “someone from QA will check it.” Bugs discovered in production are treated as inevitable. Nobody connects the lack of automated tests to the frequency of production incidents because there is no baseline to compare against.
Common variations:
- Tests exist but are never run. Someone wrote tests a year ago. The test suite is broken and nobody has fixed it. The tests are checked into the repository but are not part of any pipeline or workflow.
- Manual test scripts as the safety net. A spreadsheet or wiki page lists hundreds of manual test cases. Before each release, someone walks through them by hand. The process takes days. It is the only verification the team has.
- Testing is someone else’s job. Developers write code. A separate QA team tests it days or weeks later. The feedback loop is so long that developers have moved on to other work by the time defects are found.
- “The code is too legacy to test.” The team has decided the codebase is untestable. Functions are thousands of lines long, everything depends on global state, and there are no seams where test doubles could be inserted. This belief becomes self-fulfilling - nobody tries because everyone agrees it is impossible.
The telltale sign: when a developer makes a change, the only way to verify it works is to deploy it and see what happens.
Why This Is a Problem
Without automated tests, every change is a leap of faith. The team has no fast, reliable way to know whether code works before it reaches users. Every downstream practice that depends on confidence in the code - continuous integration, automated deployment, frequent releases - is blocked.
It reduces quality
When there are no automated tests, defects are caught by humans or by users. Humans are slow, inconsistent, and unable to check everything. A manual tester cannot verify 500 behaviors in an hour, but an automated suite can. The behaviors that are not checked are the ones that break.
Developers writing code without tests have no feedback on whether their logic is correct until someone else exercises it. A function that handles an edge case incorrectly will not be caught until a user hits that edge case in production. By then, the developer has moved on and lost context on the code they wrote.
With even a basic suite of automated tests, developers get feedback in minutes. They catch their own mistakes while the code is fresh. The suite runs the same checks every time, never forgetting an edge case and never getting tired.
It increases rework
Without tests, rework comes from two directions. First, bugs that reach production must be investigated, diagnosed, and fixed - work that an automated test would have prevented. Second, developers are afraid to change existing code because they have no way to verify they have not broken something. This fear leads to workarounds: copy-pasting code instead of refactoring, adding conditional branches instead of restructuring, and building new modules alongside old ones instead of modifying what exists.
Over time, the codebase becomes a patchwork of workarounds layered on workarounds. Each change takes longer because the code is harder to understand and more fragile. The absence of tests is not just a testing problem - it is a design problem that compounds with every change.
Teams with automated tests refactor confidently. They rename functions, extract modules, and simplify logic knowing that the test suite will catch regressions. The codebase stays clean because changing it is safe.
It makes delivery timelines unpredictable
Without automated tests, the time between “code complete” and “deployed” is dominated by manual verification. How long that verification takes depends on how many changes are in the batch, how available the testers are, and how many defects they find. None of these variables are predictable.
A change that a developer finishes on Monday might not be verified until Thursday. If defects are found, the cycle restarts. Lead time from commit to production is measured in weeks, and the variance is enormous. Some changes take three days, others take three weeks, and the team cannot predict which.
Automated tests collapse the verification step to minutes. The time from “code complete” to “verified” becomes a constant, not a variable. Lead time becomes predictable because the largest source of variance has been removed.
Impact on continuous delivery
Automated tests are the foundation of continuous delivery. Without them, there is no automated quality gate. Without an automated quality gate, there is no safe way to deploy frequently. Without frequent deployment, there is no fast feedback from production. Every CD practice assumes that the team can verify code quality automatically. A team with no test automation is not on a slow path to CD - they have not started.
How to Fix It
Starting test automation on an untested codebase feels overwhelming. The key is to start small, establish the habit, and expand coverage incrementally. You do not need to test everything before you get value - you need to test something and keep going.
Step 1: Set up the test infrastructure
Before writing a single test, make it trivially easy to run tests:
- Choose a test framework for your primary language. Pick the most popular one - do not deliberate.
- Add the framework to the project. Configure it. Write a single test that asserts
true == trueand verify it passes. - Add a
testscript or command to the project so that anyone can run the suite with a single command (e.g.,npm test,pytest,mvn test). - Add the test command to the CI pipeline so that tests run on every push.
The goal for week one is not coverage. It is infrastructure: a working test runner in the pipeline that the team can build on.
Step 2: Write tests for every new change
Establish a team rule: every new change must include at least one automated test. Not “every new feature” - every change. Bug fixes get a regression test that fails without the fix and passes with it. New functions get a test that verifies the core behavior. Refactoring gets a test that pins the existing behavior before changing it.
This rule is more important than retroactive coverage. New code enters the codebase tested. The tested portion grows with every commit. After a few months, the most actively changed code has coverage, which is exactly where coverage matters most.
Step 3: Target high-change areas for retroactive coverage (Weeks 3-6)
Use your version control history to find the files that change most often. These are the files where bugs are most likely and where tests provide the most value:
- List the 10 files with the most commits in the last six months.
- For each file, write tests for its core public behavior. Do not try to test every line - test the functions that other code depends on.
- If the code is hard to test because of tight coupling, wrap it. Create a thin adapter around the untestable code and test the adapter. This is the Strangler Fig pattern applied to testing.
Step 4: Make untestable code testable incrementally (Weeks 4-8)
If the codebase resists testing, introduce seams one at a time:
| Problem | Technique |
|---|---|
| Function does too many things | Extract the pure logic into a separate function and test that |
| Hard-coded database calls | Introduce a repository interface, inject it, test with a fake |
| Global state or singletons | Pass dependencies as parameters instead of accessing globals |
| No dependency injection | Start with “poor man’s DI” - default parameters that can be overridden in tests |
You do not need to refactor the entire codebase. Each time you touch a file, leave it slightly more testable than you found it.
Step 5: Set a coverage floor and ratchet it up
Once you have meaningful coverage in actively changed code, set a coverage threshold in the pipeline:
- Measure current coverage. Say it is 15%.
- Set the pipeline to fail if coverage drops below 15%.
- Every two weeks, raise the floor by 2-5 percentage points.
The floor prevents backsliding. The ratchet ensures progress. The team does not need to hit 90% coverage - they need to ensure that coverage only goes up.
| Objection | Response |
|---|---|
| “The codebase is too legacy to test” | You do not need to test the legacy code directly. Wrap it in testable adapters and test those. Every new change gets a test. Coverage grows from the edges inward. |
| “We don’t have time to write tests” | You are already spending that time on manual verification and production debugging. Tests shift that cost to the left where it is cheaper. Start with one test per change - the overhead is minutes, not hours. |
| “We need to test everything before it’s useful” | One test that catches one regression is more useful than zero tests. The value is immediate and cumulative. You do not need full coverage to start getting value. |
| “Developers don’t know how to write tests” | Pair a developer who has testing experience with one who does not. If nobody on the team has experience, invest one day in a testing workshop. The skill is learnable in a week. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Test count | Should increase every sprint |
| Code coverage of actively changed files | More meaningful than overall coverage - focus on files changed in the last 30 days |
| Build duration | Should increase slightly as tests are added, but stay under 10 minutes |
| Defects found in production vs. in tests | Ratio should shift toward tests over time |
| Change fail rate | Should decrease as test coverage catches regressions before deployment |
| Manual testing effort per release | Should decrease as automated tests replace manual verification |
Team Discussion
Use these questions in a retrospective to explore how this anti-pattern affects your team:
- What percentage of our test coverage is automated today? How long would it take to run a full regression manually?
- Which parts of the system are we most afraid to change? Is that fear connected to missing test coverage?
- If we could automate one manual testing step this sprint, what would have the highest immediate impact?
Related Content
- Testing Fundamentals - How to build a test strategy for CD
- Build Automation - Tests need a pipeline to run in
- Inverted Test Pyramid - The next problem to solve once you have tests
- Manual Regression Testing Gates - The manual testing this replaces
- Deterministic Pipeline - Tests as automated quality gates
- Testing & Observability Gaps - defect categories that survive without automated test coverage.
2 - Manual Regression Testing Gates
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
Before every release, the team enters a testing phase. Testers open a spreadsheet or test management tool containing hundreds of scripted test cases. They walk through each one by hand: click this button, enter this value, verify this result. The testing takes days. Sometimes it takes weeks. Nothing ships until every case is marked pass or fail, and every failure is triaged.
Developers stop working on new features during this phase because testers need a stable build to test against. Code freezes go into effect. Bug fixes discovered during testing must be applied carefully to avoid invalidating tests that have already passed. The team enters a holding pattern where the only work that matters is getting through the test cases.
The testing effort grows with every release. New features add new test cases, but old test cases are rarely removed because nobody is confident they are redundant. A team that tested for three days six months ago now tests for five. The spreadsheet has 800 rows. Every release takes longer to validate than the last.
Common variations:
- The regression spreadsheet. A master spreadsheet of every test case the team has ever written. Before each release, a tester works through every row. The spreadsheet is the institutional memory of what the software is supposed to do, and nobody trusts anything else.
- The dedicated test phase. The sprint cadence is two weeks of development followed by one week of testing. The test week is a mini-waterfall phase embedded in an otherwise agile process. Nothing can ship until the test phase is complete.
- The test environment bottleneck. Manual testing requires a specific environment that is shared across teams. The team must wait for their slot. When the environment is broken by another team’s testing, everyone waits for it to be restored.
- The sign-off ceremony. A QA lead or manager must personally verify a subset of critical paths and sign a document before the release can proceed. If that person is on vacation, the release waits.
- The compliance-driven test cycle. Regulatory requirements are interpreted as requiring manual execution of every test case with documented evidence. Each test run produces screenshots and sign-off forms. The documentation takes as long as the testing itself.
The telltale sign: if the question “can we release today?” is always answered with “not until QA finishes,” manual regression testing is gating your delivery.
Why This Is a Problem
Manual regression testing feels responsible. It feels thorough. But it creates a bottleneck that grows worse with every feature the team builds, and the thoroughness it promises is an illusion.
It reduces quality
Manual testing is less reliable than it appears. A human executing the same test case for the hundredth time will miss things. Attention drifts. Steps get skipped. Edge cases that seemed important when the test was written get glossed over when the tester is on row 600 of a spreadsheet. Studies on manual testing consistently show that testers miss 15-30% of defects that are present in the software they are testing.
The test cases themselves decay. They were written for the version of the software that existed when the feature shipped. As the product evolves, some cases become irrelevant, others become incomplete, and nobody updates them systematically. The team is executing a test plan that partially describes software that no longer exists.
The feedback delay compounds the quality problem. A developer who wrote code two weeks ago gets a bug report from a tester during the regression cycle. The developer has lost context on the change. They re-read their own code, try to remember what they were thinking, and fix the bug with less confidence than they would have had the day they wrote it.
Automated tests catch the same classes of bugs in seconds, with perfect consistency, every time the code changes. They do not get tired on row 600. They do not skip steps. They run against the current version of the software, not a test plan written six months ago. And they give feedback immediately, while the developer still has full context.
It increases rework
The manual testing gate creates a batch-and-queue cycle. Developers write code for two weeks, then testers spend a week finding bugs in that code. Every bug found during the regression cycle is rework: the developer must stop what they are doing, reload the context of a completed story, diagnose the issue, fix it, and send it back to the tester for re-verification. The re-verification may invalidate other test cases, requiring additional re-testing.
The batch size amplifies the rework. When two weeks of changes are tested together, a bug could be in any of dozens of commits. Narrowing down the cause takes longer because there are more variables. When the same bug would have been caught by an automated test minutes after it was introduced, the developer would have fixed it in the same sitting - one context switch instead of many.
The rework also affects testers. A bug fix during the regression cycle means the tester must re-run affected test cases. If the fix changes behavior elsewhere, the tester must re-run those cases too. A single bug fix can cascade into hours of re-testing, pushing the release date further out.
With automated regression tests, bugs are caught as they are introduced. The fix happens immediately. There is no regression cycle, no re-testing cascade, and no context-switching penalty.
It makes delivery timelines unpredictable
The regression testing phase takes as long as it takes. The team cannot predict how many bugs the testers will find, how long each fix will take, or how much re-testing the fixes will require. A release planned for Friday might slip to the following Wednesday. Or the following Friday.
This unpredictability cascades through the organization. Product managers cannot commit to delivery dates because they do not know how long testing will take. Stakeholders learn to pad their expectations. “We’ll release in two weeks” really means “we’ll release in two to four weeks, depending on what QA finds.”
The unpredictability also creates pressure to cut corners. When the release is already three days late, the team faces a choice: re-test thoroughly after a late bug fix, or ship without full re-testing. Under deadline pressure, most teams choose the latter. The manual testing gate that was supposed to ensure quality becomes the reason quality is compromised.
Automated regression suites produce predictable, repeatable results. The suite runs in the same amount of time every time. There is no testing phase to slip. The team knows within minutes of every commit whether the software is releasable.
It creates a permanent scaling problem
Manual testing effort scales linearly with application size. Every new feature adds test cases. The test suite never shrinks. A team that takes three days to test today will take four days in six months and five days in a year. The testing phase consumes an ever-growing fraction of the team’s capacity.
This scaling problem is invisible at first. Three days of testing feels manageable. But the growth is relentless. The team that started with 200 test cases now has 800. The test phase that was two days is now a week. And because the test cases were written by different people at different times, nobody can confidently remove any of them without risking a missed regression.
Automated tests scale differently. Adding a new automated test adds milliseconds to the suite duration, not hours to the testing phase. A team with 10,000 automated tests runs them in the same 10 minutes as a team with 1,000. The cost of confidence is fixed, not linear.
Impact on continuous delivery
Manual regression testing is fundamentally incompatible with continuous delivery. CD requires that any commit can be released at any time. A manual testing gate that takes days means the team can release at most once per testing cycle. If the gate takes a week, the team releases at most every two or three weeks - regardless of how fast their pipeline is or how small their changes are.
The manual gate also breaks the feedback loop that CD depends on. CD gives developers confidence that their change works by running automated checks within minutes. A manual gate replaces that fast feedback with a slow, batched, human process that cannot keep up with the pace of development.
You cannot have continuous delivery with a manual regression gate. The two are mutually exclusive. The gate must be automated before CD is possible.
How to Fix It
Step 1: Catalog your manual test cases and categorize them
Before automating anything, understand what the manual test suite actually covers. For every test case in the regression suite:
- Identify what behavior it verifies.
- Classify it: is it testing business logic, a UI flow, an integration boundary, or a compliance requirement?
- Rate its value: has this test ever caught a real bug? When was the last time?
- Rate its automation potential: can this be tested at a lower level (unit, functional, API)?
Most teams discover that a large percentage of their manual test cases are either redundant (the same behavior is tested multiple times), outdated (the feature has changed), or automatable at a lower level.
Step 2: Automate the highest-value cases first (Weeks 2-4)
Pick the 20 test cases that cover the most critical paths - the ones that would cause the most damage if they regressed. Automate them:
- Business logic tests become unit tests.
- API behavior tests become functional tests.
- Critical user journeys become a small set of E2E smoke tests.
Do not try to automate everything at once. Start with the cases that give the most confidence per minute of execution time. The goal is to build a fast automated suite that covers the riskiest scenarios so the team no longer depends on manual execution for those paths.
Step 3: Run automated tests in the pipeline on every commit
Move the new automated tests into the CI pipeline so they run on every push. This is the critical shift: testing moves from a phase at the end of development to a continuous activity that happens with every change.
Every commit now gets immediate feedback on the critical paths. If a regression is introduced, the developer knows within minutes - not weeks.
Step 4: Shrink the manual suite as automation grows (Weeks 4-8)
Each week, pick another batch of manual test cases and either automate or retire them:
- Automate cases where the behavior is stable and testable at a lower level.
- Retire cases that are redundant with existing automated tests or that test behavior that no longer exists.
- Keep manual only for genuinely exploratory testing that requires human judgment - usability evaluation, visual design review, or complex workflows that resist automation.
Track the shrinkage. If the manual suite had 800 cases and now has 400, that is progress. If the manual testing phase took five days and now takes two, that is measurable improvement.
Step 5: Replace the testing phase with continuous testing (Weeks 6-8+)
The goal is to eliminate the dedicated testing phase entirely:
| Before | After |
|---|---|
| Code freeze before testing | No code freeze - trunk is always testable |
| Testers execute scripted cases | Automated suite runs on every commit |
| Bugs found days or weeks after coding | Bugs found minutes after coding |
| Testing phase blocks release | Release readiness checked automatically |
| QA sign-off required | Pipeline pass is the sign-off |
| Testers do manual regression | Testers do exploratory testing, write automated tests, and improve test infrastructure |
Step 6: Address the objections (Ongoing)
| Objection | Response |
|---|---|
| “Automated tests can’t catch everything a human can” | Correct. But humans cannot execute 800 test cases reliably in a day, and automated tests can. Automate the repeatable checks and free humans for the exploratory testing where their judgment adds value. |
| “We need manual testing for compliance” | Most compliance frameworks require evidence that testing was performed, not that humans performed it. Automated test reports with pass/fail results, timestamps, and traceability to requirements satisfy most audit requirements better than manual spreadsheets. Confirm with your compliance team. |
| “Our testers don’t know how to write automated tests” | Pair testers with developers. The tester contributes domain knowledge - what to test and why - while the developer contributes automation skills. Over time, the tester learns automation and the developer learns testing strategy. |
| “We can’t automate tests for our legacy system” | Start with new code. Every new feature gets automated tests. For legacy code, automate the most critical paths first and expand coverage as you touch each area. The legacy system does not need 100% automation overnight. |
| “What if we automate a test wrong and miss a real bug?” | Manual tests miss real bugs too - consistently. An automated test that is wrong can be fixed once and stays fixed. A manual tester who skips a step makes the same mistake next time. Automation is not perfect, but it is more reliable and more improvable than manual execution. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Manual test case count | Should decrease steadily as cases are automated or retired |
| Manual testing phase duration | Should shrink toward zero |
| Automated test count in pipeline | Should grow as manual cases are converted |
| Release frequency | Should increase as the manual gate shrinks |
| Development cycle time | Should decrease as the testing phase is eliminated |
| Time from code complete to release | Should converge toward pipeline duration, not testing phase duration |
Related Content
- Testing Fundamentals - The test architecture that replaces manual regression suites
- Deterministic Pipeline - Automated tests in the pipeline replace manual gates
- Inverted Test Pyramid - Manual regression testing often coexists with an inverted pyramid
- Build Automation - The pipeline infrastructure needed to run tests on every commit
- Value Stream Mapping - Reveals how much time the manual testing phase adds to lead time
3 - Testing Only at the End
Category: Testing & Quality | Quality Impact: High
What This Looks Like
The team works in two-week sprints. Development happens in the first week and a half. The last few days are “QA time,” when testers receive the completed work and begin exercising it. Bugs found during QA must either be fixed quickly before the deadline or pushed to the next sprint. Bugs found after the sprint closes are treated as defects and added to a bug backlog. The bug backlog grows faster than the team can clear it.
Developers consider a task “done” when their code review is merged. Testers receive the work without having been involved in defining what “tested” means. They write test cases after the fact based on the specification - if one exists - and their own judgment about what matters. The developers are already working on the next sprint by the time bugs are reported. Context has decayed. A bug found two weeks after the code was written is harder to diagnose than the same bug found two hours after.
Common variations:
- The sequential handoff. Development completes all features. Work is handed to QA. QA returns a bug list. Development fixes the bugs. Work is handed back to QA for regression testing. This cycle repeats until QA signs off. The release date is determined by how many cycles occur.
- The last-mile test environment. A test environment is only provisioned for the QA phase. Developers have no environment that resembles production and cannot test their own work in realistic conditions. All realistic testing happens at the end.
- The sprint-end test blitz. Testers are not idle during the sprint - they are catching up on testing from two sprints ago while development works on the current sprint. The lag means bugs from last sprint are still being found when the sprint they caused has been closed for two weeks.
- The separate QA team. A dedicated QA team sits organizationally separate from development. They are not in sprint planning, not in design discussions, and not consulted until code exists. Their role is validation, not quality engineering.
The telltale sign: developers and testers work on the same sprint but testers are always testing work from a previous sprint. The team is running two development cycles in parallel, offset by one iteration.
Why This Is a Problem
Testing at the end of development is a legacy of the waterfall model, where phases were sequential by design. In that model, the cost of rework was assumed to be fixed, and the way to minimize it was to catch problems as late as possible in a structured way. Agile and CD have changed those assumptions. Rework cost is lowest when defects are caught immediately, which requires testing to happen throughout development.
It reduces quality
Bugs caught late are more expensive to fix for two reasons. First, context decay: the developer who wrote the code is no longer in that code. They are working on something new. When a bug report arrives two weeks after the code was written, they must reconstruct their understanding of the code before they can understand the bug. This reconstruction is slow and error-prone.
Second, cascade effects: code written after the buggy code may depend on the bug. A calculation that produces incorrect results might be consumed by downstream logic that was written assuming the incorrect result was correct. Fixing the original bug now requires fixing everything downstream too. The further the bug travels through the codebase before being caught, the more code depends on the incorrect behavior.
When testing happens throughout development - when the developer writes a test before or alongside the code - the bug is caught in seconds or minutes. The developer has full context. The fix is immediate. Nothing downstream has been built on the incorrect behavior yet.
It increases rework
End-of-sprint testing consistently produces a volume of bugs that exceeds the team’s capacity to fix them before the deadline. The backlog of unfixed bugs grows. Teams routinely carry a bug backlog of dozens or hundreds of issues. Each issue in that backlog represents work that was done, found to be wrong, and not yet corrected - work in progress that is neither done nor abandoned.
The rework is compounded by the handoff model itself. A tester writes a bug report. A developer reads it, interprets it, fixes it, and marks it resolved. The tester verifies the fix. If the fix is wrong, another cycle begins. Each cycle includes the overhead of the handoff: context switching, communication delays, and the cost of re-familiarizing with the problem. A bug that a developer could fix in 10 minutes if caught during development might take two hours across multiple handoff cycles.
When developers and testers collaborate during development - discussing acceptance criteria before coding, running tests as code is written - the handoff cycle does not exist. Problems are found and fixed in a single context by people who both understand the problem.
It makes delivery timelines unpredictable
The duration of an end-of-development testing phase is proportional to the number of bugs found, which is not knowable in advance. Teams plan for a fixed QA window - say, three days - but if testing finds 20 critical bugs, the window stretches to two weeks. The release date, which was based on the planned QA window, is now wrong.
This unpredictability affects every stakeholder. Product managers cannot commit to delivery dates because QA is a variable they cannot control. Developers cannot start new work cleanly because they may be pulled back to fix bugs from the previous sprint. Testers are under pressure to move faster, which leads to shallower testing and more bugs escaping to production.
The further from development that testing occurs, the more the feedback cycle looks like a batch process: large batches of work go in one end, a variable quantity of bugs come out the other end, and the time to process the batch is unpredictable.
It creates organizational dysfunction
Testers who could catch a bug in the design conversation instead spend their time writing bug reports two weeks after the code shipped - and then defending their findings to developers who have already moved on. The structure wastes both their time. When testing is a separate downstream phase, the relationship between developers and testers becomes adversarial by structure. Developers want to minimize the bug count that reaches QA. Testers want to find every bug. Both objectives are reasonable, but the structure sets them in opposition: developers feel reviewed and found wanting, testers feel their work is treated as an obstacle to release.
This dysfunction persists even when individual developers and testers have good working relationships. The structure rewards developers for code that passes QA and testers for finding bugs, not for shared ownership of quality outcomes. Testers are not consulted on design decisions where their perspective could prevent bugs from being written in the first place.
Impact on continuous delivery
CD requires automated testing throughout the pipeline. A team that relies on a manual, end-of- development QA phase cannot automate it into the pipeline. The pipeline runs, but the human testing phase sits outside it. The pipeline provides only partial safety. Deployment frequency is limited to the frequency of QA cycles, not the frequency of pipeline runs.
Moving to CD requires shifting the testing model fundamentally. Testing must happen at every stage: as code is written (unit tests), as it is integrated (integration tests run in CI), and as it is promoted toward production (acceptance tests in the pipeline). The QA function shifts from end-stage bug finding to quality engineering: designing test strategies, building automation, and ensuring coverage throughout the pipeline. That shift cannot happen incrementally within the existing end-of-development model - it requires changing what testing means.
How to Fix It
Shifting testing earlier is as much a cultural and organizational change as a technical one. The goal is shared ownership of quality between developers and testers, with testing happening continuously throughout the development process.
Step 1: Involve testers in story definition
The first shift is the earliest in the process: bring testers into the conversation before development begins.
- In the next sprint planning, include a tester in story refinement.
- For each story, agree on acceptance criteria and the test cases that will verify them before coding starts.
- The developer and tester agree: “when these tests pass, this story is done.”
This single change improves quality in two ways. Testers catch ambiguities and edge cases during definition, before the code is written. And developers have a clear, testable definition of done that does not depend on the tester’s interpretation after the fact.
Step 2: Write automated tests alongside the code (Weeks 2-3)
For each story, require that automated tests be written as part of the development work.
- The developer writes the unit tests as the code is written.
- The tester authors or contributes acceptance test scripts during the sprint, not after.
- Both sets of tests run in CI on every commit. A failing test is a blocking issue.
The tests do not replace the tester’s judgment - they capture the acceptance criteria as executable specifications. The tester’s role shifts from manual execution to test strategy and exploratory testing for behaviors not covered by the automated suite.
Step 3: Give developers a production-like environment for self-testing (Weeks 2-4)
If developers test only on their local machines and testers test on a shared environment, the testing conditions diverge. Bugs that appear only in integrated environments surface during QA, not during development.
- Provision a personal or pull-request-level environment for each developer. Infrastructure as code makes this feasible at low cost.
- Developers must verify their changes in a production-like environment before marking a story ready for review.
- The shared QA environment shifts from “where testing happens” to “where additional integration testing happens,” not the first environment where the code is verified.
Step 4: Define a “definition of done” that includes tests
If the team’s definition of done allows a story to be marked complete without passing automated tests, the incentive to write tests is weak. Change the definition.
- A story is not done unless it has automated acceptance tests that pass in CI.
- A story is not done unless the developer has tested it in a production-like environment.
- A story is not done unless the tester has reviewed the test coverage and agreed it is sufficient.
This makes quality a shared gate, not a downstream handoff.
Step 5: Shift the QA function toward quality engineering (Weeks 4-8)
As automated testing takes over the verification function that manual QA was performing, the tester’s role evolves. This transition requires explicit support and re-skilling.
- Identify what currently takes the most tester time. If it is manual regression testing, that is the automation target.
- Work with testers to automate the highest-value regression tests first.
- Redirect freed tester capacity toward exploratory testing, test strategy, and pipeline quality engineering.
Testers who build automation for the pipeline provide more value than testers who manually execute scripts. They also find more bugs, because they work earlier in the process when bugs are cheaper to fix.
Step 6: Measure bug escape rate and shift the metric forward (Ongoing)
Teams that test only at the end measure quality by the number of bugs found in QA. That metric rewards QA effort, not quality outcomes. Change what is measured.
- Track where bugs are found: in development, in CI, in code review, in QA, in production.
- The goal is to shift discovery leftward. More bugs found in development is good. Fewer bugs found in QA is good. Zero bugs in production is the target.
- Review the distribution in retrospectives. When a bug reaches QA, ask: why was this not caught earlier? What test would have caught it?
| Objection | Response |
|---|---|
| “Testers are expensive - we can’t have them involved in every story” | Testers involved in definition prevent bugs from being written. A tester’s hour in planning prevents five developer hours of bug fix and retest cycle. The cost of early involvement is far lower than the cost of late discovery. |
| “Developers are not good at testing their own work” | That is true for exploratory testing of complete features. It is not true for unit tests of code they just wrote. The fix is not to separate testing from development - it is to build a test discipline that covers both developer-written tests and tester-written acceptance scenarios. |
| “We would need to slow down to write tests” | Teams that write tests as they go are faster overall. The time spent on tests is recovered in reduced debugging, reduced rework, and faster diagnosis when things break. The first sprint with tests is slower. The tenth sprint is faster. |
| “Our testers do not know how to write automation” | Automation is a skill that is learnable. Start with the testers contributing acceptance criteria in plain language and developers automating them. Grow tester automation skills over time. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Bug discovery distribution | Should shift earlier - more bugs found in development and CI, fewer in QA and production |
| Development cycle time | Should decrease as rework from late-discovered bugs is reduced |
| Change fail rate | Should decrease as automated tests catch regressions before deployment |
| Automated test count in CI | Should increase as tests are written alongside code |
| Bug backlog size | Should decrease or stop growing as fewer bugs escape development |
| Mean time to repair | Should decrease as bugs are caught closer to when the code was written |
Related Content
- Testing Fundamentals - Building the automated test suite that supports continuous testing
- QA Signoff as a Release Gate - The downstream consequence of end-of-development testing
- Manual Testing Only - The broader pattern of which this is a subset
- Work Decomposition - Smaller stories make continuous testing more practical
- Metrics-Driven Improvement - Using bug discovery distribution to guide improvement
4 - Inverted Test Pyramid
Category: Testing & Quality | Quality Impact: High
What This Looks Like
The team has tests, but the wrong kind. Running the full suite takes 30 minutes or more. Tests fail randomly. Developers rerun the pipeline and hope for green. When a test fails, the first question is “is that a real failure or a flaky test?” rather than “what did I break?”
Common variations:
- The ice cream cone. Most testing is manual. Below that, a large suite of end-to-end browser tests. A handful of integration tests. Almost no unit tests. The manual testing takes days, the E2E suite takes hours, and nothing runs fast enough to give developers feedback while they code.
- The E2E-first approach. The team believes end-to-end tests are “real” tests because they test the “whole system.” Unit tests are dismissed as “not testing anything useful” because they use mocks. The result is a suite of 500 Selenium tests that take 45 minutes and fail 10% of the time.
- The integration test swamp. Every test boots a real database, calls real services, and depends on shared test environments. Tests are slow because they set up and tear down heavy infrastructure. They are flaky because they depend on network availability and shared mutable state.
- The UI test obsession. The team writes tests exclusively through the UI layer. Business logic that could be verified in milliseconds with a unit test is instead tested through a full browser automation flow that takes seconds per assertion.
- The “we have coverage” illusion. Code coverage is high because the E2E tests exercise most code paths. But the tests are so slow and brittle that developers do not run them locally. They push code and wait 40 minutes to learn if it works. If a test fails, they assume it is flaky and rerun.
The telltale sign: developers do not trust the test suite. They push code and go get coffee. When tests fail, they rerun before investigating. When a test is red for days, nobody is alarmed.
Why This Is a Problem
An inverted test pyramid does not just slow the team down. It actively undermines every benefit that testing is supposed to provide.
The suite is too slow to give useful feedback
The purpose of a test suite is to tell developers whether their change works - fast enough that they can act on the feedback while they still have context. A suite that runs in seconds gives feedback during development. A suite that runs in minutes gives feedback before the developer moves on. A suite that runs in 30 or more minutes gives feedback after the developer has started something else entirely.
When the suite takes 40 minutes, developers do not run it locally. They push to CI and context- switch to a different task. When the result comes back, they have lost the mental model of the code they changed. Investigating a failure takes longer because they have to re-read their own code. Fixing the failure takes longer because they are now juggling two streams of work.
A well-structured suite - built on functional tests with test doubles and unit tests for complex logic - runs in under 10 minutes. Developers run it locally before pushing. Failures are caught while the code is still fresh. The feedback loop is tight enough to support continuous integration.
Flaky tests destroy trust
End-to-end tests are inherently non-deterministic. They depend on network connectivity, shared test environments, external service availability, browser rendering timing, and dozens of other factors outside the developer’s control. A test that fails because a third-party API was slow for 200 milliseconds looks identical to a test that fails because the code is wrong.
When 10% of the suite fails randomly on any given run, developers learn to ignore failures. They rerun the pipeline, and if it passes the second time, they assume the first failure was noise. This behavior is rational given the incentives, but it is catastrophic for quality. Real failures hide behind the noise. A test that detects a genuine regression gets rerun and ignored alongside the flaky tests.
Unit tests and functional tests with test doubles are deterministic. They produce the same result every time. When a deterministic test fails, the developer knows with certainty that they broke something. There is no rerun. There is no “is that real?” The failure demands investigation.
Maintenance cost grows faster than value
End-to-end tests are expensive to write and expensive to maintain. A single E2E test typically involves:
- Setting up test data across multiple services
- Navigating through UI flows with waits and retries
- Asserting on UI elements that change with every redesign
- Handling timeouts, race conditions, and flaky selectors
When a feature changes, every E2E test that touches that feature must be updated. A redesign of the checkout page breaks 30 E2E tests even if the underlying behavior has not changed. The team spends more time maintaining E2E tests than writing new features.
Functional tests and unit tests are cheap to write and cheap to maintain. They test behavior from the actor’s perspective, not UI layout or browser flows. A functional test that verifies a discount is applied correctly does not care whether the button is blue or green. When the discount logic changes, a handful of focused tests need updating - not thirty browser flows.
It couples your pipeline to external systems
When most of your tests are end-to-end or integration tests that hit real services, your ability to deploy depends on every system in the chain being available and healthy. If the payment provider’s sandbox is down, your pipeline fails. If the shared staging database is slow, your tests time out. If another team deployed a breaking change to a shared service, your tests fail even though your code is correct.
This is the opposite of what CD requires. Continuous delivery demands that your team can deploy independently, at any time, regardless of the state of external systems. A test architecture built on E2E tests makes your deployment hostage to every dependency in your ecosystem.
A suite built on unit tests, functional tests, and contract tests runs entirely within your control. External dependencies are replaced with test doubles that are validated by contract tests. Your pipeline can tell you “this change is safe to deploy” even if every external system is offline.
Impact on continuous delivery
The inverted pyramid makes CD impossible in practice even if all the other pieces are in place. The pipeline takes too long to support frequent integration. Flaky failures erode trust in the automated quality gates. Developers bypass the tests or batch up changes to avoid the wait. The team gravitates toward manual verification before deploying because they do not trust the automated suite.
A team that deploys weekly with a 40-minute flaky suite cannot deploy daily without either fixing the test architecture or abandoning automated quality gates. Neither option is acceptable. Fixing the architecture is the only sustainable path.
How to Fix It
The goal is a test suite that is fast, gives you confidence, and costs less to maintain than the value it provides. The target architecture looks like this:
| Test type | Role | Runs in pipeline? | Uses real external services? |
|---|---|---|---|
| Unit | Verify high-complexity logic - business rules, calculations, edge cases | Yes, gates the build | No |
| Functional | Verify component behavior from the actor’s perspective with test doubles for external dependencies | Yes, gates the build | No (localhost only) |
| Contract | Validate that test doubles still match live external services | Asynchronously, does not gate | Yes |
| E2E | Smoke-test critical business paths in a fully integrated environment | Post-deploy verification only | Yes |
Functional tests are the workhorse. They test what the system does for its actors - a user interacting with a UI, a service consuming an API - without coupling to internal implementation or external infrastructure. They are fast because they avoid real I/O. They are deterministic because they use test doubles for anything outside the component boundary. They survive refactoring because they assert on outcomes, not method calls.
Unit tests complement functional tests for code with high cyclomatic complexity where you need to exercise many permutations quickly - branching business rules, validation logic, calculations with boundary conditions. Do not write unit tests for trivial code just to increase coverage.
E2E tests exist only for the small number of critical paths that genuinely require a fully integrated environment to validate. A typical application needs fewer than a dozen.
Step 1: Audit and stabilize
Map your current test distribution. Count tests by type, measure total duration, and identify every test that requires a real external service or produces intermittent failures.
Quarantine every flaky test immediately - move it out of the pipeline-gating suite. For each one, decide: fix it if the flakiness has a solvable cause, replace it with a deterministic functional test, or delete it if the behavior is already covered elsewhere. Flaky tests erode confidence and train developers to ignore failures. Target zero flaky tests in the gating suite by end of week.
Step 2: Build functional tests for your highest-risk components (Weeks 2-4)
Pick the components with the highest defect rate or the most E2E test coverage. For each one:
- Identify the actors - who or what interacts with this component?
- Write functional tests from the actor’s perspective. A user submitting a form, a service calling an API endpoint, a consumer reading from a queue. Test through the component’s public interface.
- Replace external dependencies with test doubles. Use in-memory databases or testcontainers for data stores, HTTP stubs (WireMock, nock, MSW) for external APIs, and fakes or spies for message queues. Prefer running a dependency locally over mocking it entirely - don’t poke more holes in reality than you need to stay deterministic.
- Add contract tests to validate that your test doubles still match the real services. Contract tests verify format, not specific data. Run them asynchronously - they should not block the build, but failures should trigger investigation.
As functional tests come online, remove the E2E tests that covered the same behavior. Each replacement makes the suite faster and more reliable.
Step 3: Add unit tests where complexity demands them (Weeks 2-4)
While building out functional tests, identify the high-complexity logic within each component - discount calculations, eligibility rules, parsing, validation. Write unit tests for these using TDD: failing test first, implementation, then refactor.
Test public APIs, not private methods. If a refactoring that preserves behavior breaks your unit tests, the tests are coupled to implementation details. Move that coverage up to a functional test.
Step 4: Reduce E2E to critical-path smoke tests (Weeks 4-6)
With functional tests covering component behavior, most E2E tests are now redundant. For each remaining E2E test, ask: “Does this test a scenario that functional tests with test doubles already cover?” If yes, remove it.
Keep E2E tests only for the critical business paths that require a fully integrated environment - paths where the interaction between independently deployed systems is the thing you need to verify. Horizontal E2E tests that span multiple teams should never block the pipeline due to their failure surface area. Move surviving E2E tests to a post-deploy verification suite.
Step 5: Set the standard for new code (Ongoing)
Every change gets tests. Establish the team norm for what kind:
- Functional tests are the default. Every new feature, endpoint, or workflow gets tests from the actor’s perspective, with test doubles for external dependencies.
- Unit tests are for complex logic. Business rules with many branches, calculations with edge cases, parsing and validation.
- E2E tests are rare. Added only for new critical business paths where functional tests cannot provide equivalent confidence.
- Bug fixes get a regression test at the level that catches the defect most directly.
Test code is a first-class citizen that requires as much design and maintenance as production code. Duplication in tests is acceptable - tests should be readable and independent, not DRY at the expense of clarity.
Address the objections
| Objection | Response |
|---|---|
| “Functional tests with test doubles don’t test anything real” | They test real behavior from the actor’s perspective. A functional test verifies the logic of order submission and that the component handles each possible response correctly - success, validation failure, timeout - without waiting on a live service. Contract tests running asynchronously validate that your test doubles still match the real service contracts. |
| “E2E tests catch bugs that other tests miss” | A small number of critical-path E2E tests catch bugs that cross system boundaries. But hundreds of E2E tests do not catch proportionally more - they add flakiness and wait time. Most integration bugs are caught by functional tests with well-maintained test doubles validated by contract tests. |
| “We can’t delete E2E tests - they’re our safety net” | A flaky safety net gives false confidence. Replace E2E tests with deterministic functional tests that catch bugs reliably, then keep a small E2E smoke suite for post-deploy verification of critical paths. |
| “Our code is too tightly coupled to test at the component level” | That is an architecture problem. Start by writing functional tests for new code and refactoring existing code as you touch it. Use the Strangler Fig pattern to wrap untestable code in a testable layer. |
| “We don’t have time to redesign the test suite” | You are already paying the cost in slow feedback, flaky builds, and manual verification. The fix is incremental: replace one E2E test with a functional test each day. After a month, the suite is measurably faster and more reliable. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Test suite duration | Should decrease toward under 10 minutes |
| Flaky test count in gating suite | Should reach and stay at zero |
| Functional test coverage of key components | Should increase as E2E tests are replaced |
| E2E test count | Should decrease to a small set of critical-path smoke tests |
| Pipeline pass rate | Should increase as non-deterministic tests are removed from the gate |
| Developers running tests locally | Should increase as the suite gets faster |
| External dependencies in gating tests | Should reach zero (localhost only) |
Team Discussion
Use these questions in a retrospective to explore how this anti-pattern affects your team:
- When a new regression is caught in production, what type of test would have caught it earlier - unit, integration, or end-to-end?
- How long does our end-to-end test suite take to run? Would we be able to run it on every commit?
- If we could only write one new test today, what is the riskiest untested behavior we would cover?
Related Content
- Testing Fundamentals - The test architecture guide for CD pipelines
- Unit Tests - Writing fast, deterministic tests for logic
- Functional Tests - Testing your system in isolation with test doubles
- Contract Tests - Verifying that test doubles match reality
- Test Doubles - Techniques for replacing external dependencies in tests
- End-to-End Tests - When and how to use E2E tests appropriately
- Testing & Observability Gaps - the defect categories this anti-pattern fails to catch.
5 - Code Coverage Mandates
Category: Testing & Quality | Quality Impact: Medium
What This Looks Like
The organization sets a coverage target - 80%, 90%, sometimes 100% - and gates the pipeline on it. Teams scramble to meet the number. The dashboard turns green. Leadership points to the metric as evidence that quality is improving. But production defect rates do not change.
Common variations:
- The assertion-free test. Developers write tests that call functions and catch no exceptions but never assert on the return value. The coverage tool records the lines as covered. The test verifies nothing.
- The getter/setter farm. The team writes tests for trivial accessors, configuration constants, and boilerplate code to push coverage up. Complex business logic with real edge cases remains untested because it is harder to write tests for.
- The one-assertion integration test. A single integration test boots the application, hits an endpoint, and checks for a 200 response. The test covers hundreds of lines across dozens of functions. None of those functions have their logic validated individually.
- The retroactive coverage sprint. A team behind on the target spends a week writing tests for existing code. The tests are written by people who did not write the code, against behavior they do not fully understand. The tests pass today but encode current behavior as correct whether it is or not.
The telltale sign: coverage goes up and defect rates stay flat. The team has more tests but not more confidence.
Why This Is a Problem
A coverage mandate confuses activity with outcome. The goal is defect prevention, but the metric measures line execution. Teams optimize for the metric and the goal drifts out of focus.
It reduces quality
Coverage measures whether a line of code executed during a test run, not whether the test verified
anything meaningful about that line. A test that calls calculateDiscount(100, 0.1) without
asserting on the return value covers the function completely. It catches zero bugs.
When the mandate is the goal, teams write the cheapest tests that move the number. Trivial code gets thorough tests. Complex code - the code most likely to contain defects - gets shallow coverage because testing it properly takes more time and thought. The coverage number rises while the most defect-prone code remains effectively untested.
Teams that focus on testing behavior rather than hitting a number write fewer tests that catch more bugs. They test the discount calculation with boundary values, error cases, and edge conditions. Each test exists because it verifies something the team needs to be true, not because it moves a metric.
It increases rework
Tests written to satisfy a mandate tend to be tightly coupled to implementation. When the team writes a test for a private method just to cover it, any refactoring of that method breaks the test even if the public behavior is unchanged. The team spends time updating tests that were never catching bugs in the first place.
Retroactive coverage efforts are especially wasteful. A developer spends a day writing tests for code someone else wrote months ago. They do not fully understand the intent, so they encode current behavior as correct. When a bug is later found in that code, the test passes - it asserts on the buggy behavior.
Teams that write tests alongside the code they are developing avoid this. The test reflects the developer’s intent at the moment of writing. It verifies the behavior they designed, not the behavior they observed after the fact.
It makes delivery timelines unpredictable
Coverage gates add a variable tax to every change. A developer finishes a feature, pushes it, and the pipeline rejects it because coverage dropped by 0.3%. Now they have to write tests for unrelated code to bring the number back up before the feature can ship.
The unpredictability compounds when the mandate is aggressive. A team at 89% with a 90% target cannot ship any change that touches untested legacy code without first writing tests for that legacy code. Features that should take a day take three because the coverage tax is unpredictable and unrelated to the work at hand.
Impact on continuous delivery
CD requires fast, reliable feedback from the test suite. Coverage mandates push teams toward test suites that are large but weak - many tests, few meaningful assertions, slow execution. The suite takes longer to run because there are more tests. It catches fewer defects because the tests were written to cover lines, not to verify behavior. Developers lose trust in the suite because passing tests do not correlate with working software.
The mandate also discourages refactoring, which is critical for maintaining a codebase that supports CD. Every refactoring risks dropping coverage, triggering the gate, and blocking the pipeline. Teams avoid cleanup work because the coverage cost is too high. The codebase accumulates complexity that makes future changes slower and riskier.
How to Fix It
Step 1: Audit what the coverage number actually represents
Pick 20 tests at random from the suite. For each one, answer:
- Does this test assert on a meaningful outcome?
- Would this test fail if the code it covers had a bug?
- Is the code it covers important enough to test?
If more than half fail these questions, the coverage number is misleading the organization. Present the findings to stakeholders alongside the production defect rate.
Step 2: Replace the coverage gate with a coverage floor
A coverage gate rejects any change that drops coverage below the target. A coverage floor rejects any change that reduces coverage from where it is. The difference matters.
- Measure current coverage. Set that as the floor.
- Configure the pipeline to fail only if a change decreases coverage.
- Remove the absolute target (80%, 90%, etc.).
The floor prevents backsliding without forcing developers to write pointless tests to meet an arbitrary number. Coverage can only go up, but it goes up because developers are writing real tests for real changes.
Step 3: Introduce mutation testing on high-risk code (Weeks 3-4)
Mutation testing measures test effectiveness, not test coverage. A mutation testing tool modifies
your code in small ways (changing > to >=, flipping a boolean, removing a statement) and
checks whether your tests detect the change. If a mutation survives - the code changed but all
tests still pass - you have a gap in your test suite.
Start with the modules that have the highest defect rate. Run mutation testing on those modules and use the surviving mutants to identify where tests are weak. Write targeted tests to kill surviving mutants. This focuses testing effort where it matters most.
Step 4: Shift the metric to defect detection (Weeks 4-6)
Replace coverage as the primary quality metric with metrics that measure outcomes:
| Old metric | New metric |
|---|---|
| Line coverage percentage | Escaped defect rate (defects found in production per release) |
| Coverage trend | Mutation score on high-risk modules |
| Tests added per sprint | Defects caught by tests per sprint |
Report both sets of metrics for a transition period. As the team sees that mutation scores and escaped defect rates are better indicators of test suite health, the coverage number becomes informational rather than a gate.
Step 5: Address the objections
| Objection | Response |
|---|---|
| “Without a coverage target, developers won’t write tests” | A coverage floor prevents backsliding. Code review catches missing tests. Mutation testing catches weak tests. These mechanisms are more effective than a number that incentivizes the wrong behavior. |
| “Our compliance framework requires coverage targets” | Most compliance frameworks require evidence of testing, not a specific coverage number. Mutation scores, defect detection rates, and test-per-change policies satisfy auditors better than a coverage percentage that does not correlate with quality. |
| “Coverage went up and we had fewer bugs - it’s working” | Correlation is not causation. Check whether the coverage increase came from meaningful tests or from assertion-free line touching. If the mutation score did not also improve, the coverage increase is cosmetic. |
| “We need a number to track improvement” | Track mutation score instead. It measures what coverage pretends to measure - whether your tests actually detect bugs. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Escaped defect rate | Should decrease as test effectiveness improves |
| Mutation score (high-risk modules) | Should increase as weak tests are replaced with behavior-focused ones |
| Change fail rate | Should decrease as real defects are caught before production |
| Tests with meaningful assertions (sample audit) | Should increase over time |
| Time spent writing retroactive coverage tests | Should decrease toward zero |
| Pipeline rejections due to coverage gate | Should drop to zero once gate is replaced with floor |
Related Content
- Testing Fundamentals - The test architecture guide for CD pipelines
- Inverted Test Pyramid - When most tests are at the wrong level
- Pressure to Skip Testing - When teams face pressure that undermines test quality
- Unit Tests - Writing fast, deterministic tests for logic
- ACD - Why coverage mandates are especially dangerous when agents optimize for coverage rather than intent
6 - QA Signoff as a Release Gate
Category: Testing & Quality | Quality Impact: High
What This Looks Like
Before any deployment to production, a specific person - often a QA lead or test manager - must give explicit approval. The approval is based on running a manual test script, performing exploratory testing, and using their personal judgment about whether the system is ready. The release cannot proceed until that person says so.
The process seems reasonable until the blocking effects become visible. The QA lead has three releases queued for approval simultaneously. One is straightforward - a minor config change. One is a large feature that requires two days of testing. One is a hotfix for a production issue that is costing the company money every hour it is unresolved. All three are waiting in line for the same person.
Common variations:
- The approval committee. No single person can approve a release - a group of stakeholders must all sign off. Any one member can block or delay the release. Scheduling the committee meeting is itself a multi-day coordination exercise.
- The inherited process. The QA signoff gate was established years ago after a serious production incident. The specific person who initiated the process has left the company. The process remains, enforced by institutional memory and change-aversion, even though the team’s test automation has grown significantly since then.
- The scope creep gate. The signoff was originally limited to major releases. Over time, it expanded to include minor releases, then patches, then hotfixes. Every deployment now requires the same approval regardless of scope or risk level.
- The invisible queue. The QA lead does not formally track what is waiting for approval. Developers must ask individually, check in repeatedly, and sometimes discover that their deployment has been waiting for a week because the request was not seen.
The telltale sign: the deployment frequency ceiling is the QA lead’s available hours per week. If they are on holiday, releases stop.
Why This Is a Problem
Manual release gates are a quality control mechanism designed for a world where testing automation did not exist. They made sense when the only way to know if a system worked was to have a skilled human walk through it. In an environment with comprehensive automated testing, manual gates are a bottleneck that provides marginal additional safety at high throughput cost.
It reduces quality
When three releases are queued and the QA lead has two days, each release gets a fraction of the attention it would receive if reviewed alone. The scenarios that do not get covered are exactly where the next production incident will come from. Manual testing at the end of a release cycle is inherently incomplete. A skilled tester can exercise a subset of the system’s behavior in the time available. They bring experience and judgment, but they cannot replicate the coverage of a well-built automated suite. An automated regression suite runs the same hundreds of scenarios every time. A manual tester prioritizes based on what seems most important and what they have time for.
The bounded time for manual testing means that when there is a large change set to test, each scenario gets less attention. Testers are under pressure to approve or reject quickly because there are queued releases waiting. Rushed testing finds fewer bugs than thorough testing. The gate that appears to protect quality is actually reducing the quality of the safety check because of the throughput pressure it creates.
When the automated test suite is the gate, it runs the same scenarios every time regardless of load or time pressure. It does not get rushed. Adding more coverage requires writing tests, not extending someone’s working hours.
It increases rework
A bug that a developer would fix in 30 minutes if caught immediately consumes three hours of combined developer and tester time when it cycles through a gate review. Multiply that by the number of releases in the queue. Manual testing as a gate produces a batch of bug reports at the end of the development cycle. The developer whose code is blocked must context-switch from their current work to fix the reported bugs. The fixes then go back through the gate. If the QA lead finds new issues in the fix, the cycle repeats.
Each round of the manual gate cycle adds overhead: the tester’s time, the developer’s context switch, the communication overhead of the bug report and fix exchange, and the calendar time waiting for the next gate review. A bug that a developer would fix in 30 minutes if discovered immediately may consume three hours of combined developer and tester time when caught through a gate cycle.
The rework also affects other developers indirectly. If one release is blocked at the gate, other releases that depend on it are also blocked. A blocked release holds back the testing of dependent work that cannot be approved without the preceding release.
It makes delivery timelines unpredictable
The time a release spends at the manual gate is determined by the QA lead’s schedule, not by the release’s complexity. A simple change might wait days because the QA lead is occupied with a complex one. A complex change that requires two days of testing may wait an additional two days because the QA lead is unavailable when testing is complete.
This gate time is entirely invisible in development estimates. Developers estimate how long it takes to build a feature. They do not estimate QA lead availability. When a feature that took three days to develop sits at the gate for a week, the total time from start to deployment is ten days. Stakeholders experience the release as late even though development finished on time.
Sprint velocity metrics are also distorted. The team shows high velocity because they count tickets as complete when development finishes. But from a user perspective, nothing is done until it is deployed and in production. The manual gate disconnects “done” from “deployed.”
It creates a single point of failure
When one person controls deployment, the deployment frequency is capped by that person’s capacity and availability. Vacation, illness, and competing priorities all stop deployments. This is not a hypothetical risk - it is a pattern every team with a manual gate experiences repeatedly.
The concentration of authority also makes that person’s judgment a variable in every release. Their threshold for approval changes based on context: how tired they are, how much pressure they feel, how risk-tolerant they are on any given day. Two identical releases may receive different treatment. This inconsistency is not a criticism of the individual - it is a structural consequence of encoding quality standards in a human judgment call rather than in explicit, automated criteria.
Impact on continuous delivery
A manual release gate is definitionally incompatible with continuous delivery. CD requires that the pipeline provides the quality signal, and that signal is sufficient to authorize deployment. A human gate that overrides or supplements the pipeline signal inserts a manual step that the pipeline cannot automate around.
Teams with manual gates are limited to deploying as often as a human can review and approve releases. Realistically, this is once or twice a week per approver. CD targets multiple deployments per day. The gap is not closable by optimizing the manual process - it requires replacing the manual gate with automated criteria that the pipeline can evaluate.
The manual gate also makes deployment a high-ceremony event. When deployment requires scheduling a review and obtaining sign-off, teams batch changes to make each deployment worth the ceremony. Batching increases risk, which makes the approval process feel more important, which increases the ceremony further. CD requires breaking this cycle by making deployment routine.
How to Fix It
Replacing a manual release gate requires building the automated confidence to substitute for the manual judgment. The gate is not removed on day one - it is replaced incrementally as automation earns trust.
Step 1: Audit what the gate is actually catching
The goal of this step is to understand what value the manual gate provides so it can be replaced with something equivalent, not just removed.
- Review the last six months of QA signoff outcomes. How many releases were rejected and why?
- For the rejections, categorize the bugs found: what type were they, how severe, what was their root cause?
- Identify which bugs would have been caught by automated tests if those tests existed.
- Identify which bugs required human judgment that no automated test could replicate.
Most teams find that 80-90% of gate rejections are for bugs that an automated test would have caught. The remaining cases requiring genuine human judgment are usually exploratory findings about usability or edge cases in new features - a much smaller scope for manual review than a full regression pass.
Step 2: Automate the regression checks that the gate is compensating for (Weeks 2-6)
For every bug category from Step 1 that an automated test would have caught, write the test.
- Prioritize by frequency: the bug types that caused the most rejections get tests first.
- Add the tests to CI so they run on every commit.
- Track the gate rejection rate as automation coverage increases. Rejections from automated- testable bugs should decrease.
The goal is to reach a point where a gate rejection would only happen for something genuinely outside the automated suite’s coverage. At that point, the gate is reviewing a much smaller and more focused scope.
Step 3: Formalize the automated approval criteria
Define exactly what a pipeline must show before a deployment is considered approved. Write it down. Make it visible.
Typical automated approval criteria:
- All unit and integration tests pass.
- All acceptance tests pass.
- Code coverage has not decreased below the threshold.
- No new high-severity security vulnerabilities in the dependency scan.
- Performance tests show no regression from baseline.
These criteria are not opinions. They are executable. When all criteria pass, deployment is authorized without manual review.
Step 4: Run manual and automated gates in parallel (Weeks 4-8)
Do not remove the manual gate immediately. Run both processes simultaneously for a period.
- The pipeline evaluates automated criteria and records pass or fail.
- The QA lead still performs manual review.
- Track every case where manual review finds something the automated criteria missed.
Each case where manual review finds something automation missed is an opportunity to add an automated test. Each case where automated criteria caught everything is evidence that the manual gate is redundant.
After four to eight weeks of parallel operation, the data either confirms that the manual gate is providing significant additional value (rare) or shows that it is confirming what the pipeline already knows (common). The data makes the decision about removing the gate defensible.
Step 5: Replace the gate with risk-scoped manual testing
When parallel operation shows that automated criteria are sufficient for most releases, change the manual review scope.
- For changes below a defined risk threshold (bug fixes, configuration changes, low-risk features), automated criteria are sufficient. No manual review required.
- For changes above the threshold (major new features, significant infrastructure changes), a focused manual review covers only the new behavior. Not a full regression pass.
- Exploratory testing continues on a scheduled cadence - not as a gate but as a proactive quality activity.
This gives the QA lead a role proportional to the actual value they provide: focused expert review of high-risk changes and exploratory quality work, not rubber-stamping releases that the pipeline has already validated.
Step 6: Document and distribute deployment authority (Ongoing)
A single approver is a fragility regardless of whether the approval is automated or manual. Distribute deployment authority explicitly.
- Any engineer can trigger a production deployment if the pipeline passes.
- The team agrees on the automated criteria that constitute approval.
- No individual holds veto power over a passing pipeline.
Expect pushback and address it directly:
| Objection | Response |
|---|---|
| “Automated tests can’t replace human judgment” | Correct. But most of what the manual gate tests is not judgment - it is regression verification. Narrow the manual review scope to the cases that genuinely require judgment. For everything else, automated tests are more thorough and more consistent than a manual check. |
| “We had a serious incident because we skipped QA” | The incident happened because a gap in automated coverage was not caught. The fix is to close the coverage gap, not to keep a human in the loop for all releases. A human in the loop for a release that already has comprehensive automated coverage adds no safety. |
| “Compliance requires a human approval before every production change” | Automated pipeline approvals with an audit log satisfy most compliance frameworks, including SOC 2 and ISO 27001. Review the specific compliance requirement with legal or a compliance specialist before assuming it requires manual gates. |
| “Removing the gate will make the QA lead feel sidelined” | Shifting from gate-keeper to quality engineer is a broader and more impactful role. Work with the QA lead to design what their role looks like in a pipeline-first model. Quality engineering, test strategy, and exploratory testing are all high-value activities that do not require blocking every release. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Gate wait time | Should decrease as automated criteria replace manual review scope |
| Release frequency | Should increase as the per-release ceremony drops |
| Lead time | Should decrease as gate wait time is removed from the delivery cycle |
| Gate rejection rate | Should decrease as automated tests catch bugs before they reach the gate |
| Change fail rate | Should remain stable or improve as automated criteria are strengthened |
| Mean time to repair | Should decrease as deployments, including hotfixes, are no longer queued behind a manual gate |
Related Content
- Testing Only at the End - The upstream pattern that makes the manual gate feel necessary
- Manual Regression Testing Gates - The specific regression testing practice that often drives this gate
- Testing Fundamentals - Building the automated coverage that replaces manual gate function
- Pipeline Architecture - Encoding quality criteria in the pipeline rather than in individual approvals
- Metrics-Driven Improvement - Using data from the gate audit to prioritize test automation investment
7 - No Contract Testing Between Services
Category: Testing & Quality | Quality Impact: High
What This Looks Like
The orders service and the inventory service are developed and tested by separate teams. Each service has a comprehensive test suite. Both suites pass on every build. Then the teams deploy to the shared staging environment and run integration tests. The payment service call to the inventory service returns an unexpected response format. The field that the payment service expects as a string is now returned as a number. The deployment blocks. The two teams spend half a day in meetings tracing when the response format changed and which team is responsible for fixing it.
This happens because neither team tested the integration point. The inventory team tested that their service worked correctly. The payment team tested that their service worked correctly - but against a mock that reflected their own assumption about the response format, not the actual inventory service behavior. The services were tested in isolation against different assumptions, and those assumptions diverged without anyone noticing.
Common variations:
- The stale mock. One service tests against a mock that was accurate six months ago. The real service has been updated several times since then. The mock drifts. The consumer service tests pass but the integration fails.
- The undocumented API. The service has no formal API specification. Consumers infer the contract from the code, from old documentation, or from experimentation. Different consumers make different inferences. When the provider changes, the consumers that made the wrong inference break.
- The implicit contract. The provider team does not think of themselves as maintaining a contract. They change the response structure because it suits their internal refactoring. They do not notify consumers because they did not know anyone was relying on the exact structure.
- The integration environment as the only test. Teams avoid writing contract tests because “we can just test in staging.” The integration environment is available infrequently, is shared among all teams, and is often broken for reasons unrelated to the change being tested. It is a poor substitute for fast, isolated contract verification.
The telltale sign: integration failures are discovered in a shared environment rather than in each team’s own pipeline. The staging environment is the first place where the contract incompatibility becomes visible.
Why This Is a Problem
Services that test in isolation but break when integrated have defeated the purpose of both isolation and integration testing. The isolation provides confidence that each service is internally correct, but says nothing about whether services work together. The integration testing catches the problem too late - after both teams have completed their work and scheduled deployments.
It reduces quality
Integration bugs caught in a shared environment are expensive to diagnose. The failure is observed by both teams, but the cause could be in either service, in the environment, or in the network between them. Diagnosing which change caused the regression requires both teams to investigate, correlate recent changes, and agree on root cause. This is time-consuming even when both teams cooperate - and the incentive to cooperate can be strained when one team’s deployment is blocking the other’s.
Without contract tests, the provider team has no automated feedback about whether their changes break consumers. They can refactor their internal structures freely because the only check is an integration test that runs in a shared environment, infrequently, and not on the provider’s own pipeline. By the time the breakage is discovered, the provider team has moved on from the context of the change.
With contract tests, the provider’s pipeline runs consumer expectations against every build. A change that would break a consumer fails the provider’s own build, immediately, in the context where the breaking change was made. The provider team knows about the breaking change before it leaves their pipeline.
It increases rework
Two teams spend half a day in meetings tracing when a response field changed from string to number - work that contract tests would have caught in the provider’s pipeline before the consumer team was ever involved. When a contract incompatibility is discovered in a shared environment, the investigation and fix cycle involves multiple teams. Someone must diagnose the failure. Someone must determine which side of the interface needs to change. Someone must make the change. The change must be reviewed, tested, and deployed. If the provider team makes the fix, the consumer team must verify it. If the consumer team makes the fix, they may be building on incorrect assumptions about the provider’s future behavior.
This multi-team rework cycle is expensive regardless of how well the teams communicate. It requires context switching from whatever both teams are working on, coordination overhead, and a second trip through deployment. A consumer change that was ready to deploy is now blocked while the provider team makes a fix that was not planned in their sprint.
Without contract tests, this rework cycle is the normal mode for discovering interface incompatibilities. With contract tests, the incompatibility is caught in the provider’s pipeline as a one-team problem, before any consumer is affected.
It makes delivery timelines unpredictable
Teams that rely on a shared integration environment for contract verification must coordinate their deployments. Service A cannot deploy until it has been tested with the current version of Service B in the shared environment. If Service B is broken due to an unrelated issue, Service A is blocked even though Service A has nothing to do with Service B’s problem.
This coupling of deployment schedules eliminates the independent delivery cadences that a service architecture is supposed to provide. When one service’s integration environment test fails, all services waiting to be tested are delayed. The deployment queue becomes a bottleneck that grows whenever any component has a problem.
Each integration failure in the shared environment is also an unplanned event. Sprints budget for development and known testing cycles. They do not budget for multi-team integration investigations. When an integration failure blocks a deployment, both teams are working on an unplanned activity with no clear end date. The sprint commitments for both teams are now at risk.
It defeats the independence benefit of a service architecture
Service B is blocked from deploying because the shared integration environment is broken - not by a problem in Service B, but by an unrelated failure in Service C. Independent deployability in name is not independent deployability in practice. The primary operational benefit of a service architecture is independent deployability: each service can be deployed on its own schedule by its own team. That benefit is available only if each team can verify their service’s correctness without depending on the availability of all other services.
Without contract tests, the teams have built isolated development pipelines but must converge on a shared integration environment before deploying. The integration environment is the coupling point. It is the equivalent of a shared deployment step in a monolith, except less reliable because the environment involves real network calls, shared infrastructure, and the simultaneous states of multiple services.
Contract testing replaces the shared integration environment dependency with a fast, local, team- owned verification. Each team verifies their side of every contract in their own pipeline. Integration failures are caught as breaking changes, not as runtime failures in shared infrastructure.
Impact on continuous delivery
CD requires fast, reliable feedback. A shared integration environment that catches contract failures is neither fast nor reliable. It is slow because it requires all services to be deployed to one place and exercised together. It is unreliable because any component failure degrades confidence in the whole environment.
Without contract tests, teams must either wait for integration environment results before deploying - limiting frequency to the environment’s availability and stability - or accept the risk that their deployment might break consumers when it reaches production. Neither option supports continuous delivery. The first caps deployment frequency at integration test cadence. The second ships contract violations to production.
How to Fix It
Contract testing is the practice of making API expectations explicit and verifying them automatically on both the provider and consumer side. The most practical implementation for most teams is consumer-driven contract testing: consumers publish their expectations, providers verify their service satisfies them.
Step 1: Identify the highest-risk integration points
Not all service integrations carry equal risk. Start where contract failures cause the most pain.
- List all service-to-service integrations. For each one, identify the last time a contract failure occurred and what it blocked.
- Rank by two factors: frequency of change (integrations between actively developed services) and blast radius (integrations where a failure blocks critical paths).
- Pick the two or three integrations at the top of the ranking. These are the pilot candidates for contract testing.
Do not try to add contract tests for every integration at once. A pilot with two integrations teaches the team the tooling and workflow before scaling.
Step 2: Choose a contract testing approach
Two common approaches:
Consumer-driven contracts: the consumer writes tests that describe their expectations of the provider. A tool like Pact captures these expectations as a contract file. The provider runs the contract file against their service to verify it satisfies the consumer’s expectations.
Provider-side contract verification with a schema: the provider publishes an OpenAPI or JSON Schema specification. Consumers generate test clients from the schema. Both sides regenerate their artifacts whenever the schema changes and verify their code compiles and passes against it.
Consumer-driven contracts are more precise - they capture exactly what each consumer uses, not the full API surface. Schema-based approaches are simpler to start and require less tooling. For most teams starting out, the schema approach is the right entry point.
Step 3: Write consumer contract tests for the pilot integrations (Weeks 2-3)
For each pilot integration, the consumer team writes tests that explicitly state their expectations of the provider.
In JavaScript using Pact:
The test documents what the consumer expects and verifies the consumer handles that response correctly. The Pact file generated by the test is the contract artifact.
Step 4: Add provider verification to the provider’s pipeline (Weeks 2-3)
The provider team adds a step to their pipeline that runs the consumer contract files against their service.
In Java with Pact:
When the provider’s pipeline runs this test, it fetches the consumer’s contract file, sets up the required state, and verifies that the provider’s real response matches the consumer’s expectations. A change that would break the consumer fails the provider’s pipeline.
Step 5: Integrate with a contract broker
For the contract tests to work across team boundaries, contract files must be shared automatically.
- Deploy a Pact Broker or use PactFlow (hosted). This is a central store for contract files.
- Consumer pipelines publish contracts to the broker after tests pass.
- Provider pipelines fetch consumer contracts from the broker and run verification.
- The broker tracks which provider versions satisfy which consumer contracts.
With the broker in place, both teams’ pipelines are connected through the contract without requiring any direct coordination. The provider knows immediately when a change breaks a consumer. The consumer knows when their version of the contract has been verified by the provider.
Step 6: Use the “can I deploy?” check before every production deployment
The broker provides a query: given the version of Service A I am about to deploy, and the versions of all other services currently in production, are all contracts satisfied?
Add this check as a pipeline gate before any production deployment. If the check fails, the service cannot deploy until the contract incompatibility is resolved.
This replaces the shared integration environment as the final contract verification step. The check is fast, runs against data already collected by previous pipeline runs, and provides a definitive answer without requiring a live deployment.
| Objection | Response |
|---|---|
| “Contract testing is a lot of setup for simple integrations” | The upfront setup cost is real. Evaluate it against the cost of the integration failures you have had in the last six months. For active services with frequent changes, the setup cost is recovered quickly. For stable services that change rarely, the cost may not be justified - start with the active ones. |
| “The provider team cannot take on more testing work right now” | Start with the consumer side only. Consumer tests that run against mocks provide value immediately, even before the provider adds verification. Add provider verification later when capacity allows. |
| “We use gRPC / GraphQL / event-based messaging - Pact doesn’t support that” | Pact supports gRPC and message-based contracts. GraphQL has dedicated contract testing tools. The principle - publish expectations, verify them against the real service - applies to any protocol. |
| “Our integration environment already catches these issues” | It catches them late, blocks multiple teams, and is expensive to diagnose. Contract tests catch the same issues in the provider’s pipeline, before any other team is affected. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Integration failures in shared environments | Should decrease as contract tests catch incompatibilities in individual pipelines |
| Time to diagnose integration failures | Should decrease as failures are caught closer to the change that caused them |
| Change fail rate | Should decrease as production contract violations are caught by pipeline checks |
| Lead time | Should decrease as integration verification no longer requires coordination through a shared environment |
| Service-to-service integrations with contract coverage | Should increase as the practice scales from pilot integrations |
| Release frequency | Should increase as teams can deploy independently without waiting for integration environment slots |
Related Content
- Testing Fundamentals - Building the test strategy that includes contract testing
- Shared Database Across Services - A common cause of implicit contracts that are hard to version
- Production-Like Environments - Reducing reliance on shared integration environments
- Architecture Decoupling - Designing service boundaries that make contracts stable
- Pipeline Architecture - Incorporating contract verification into the deployment pipeline
8 - Rubber-Stamping AI-Generated Code
Category: Testing & Quality | Quality Impact: Critical
What This Looks Like
A developer uses an AI assistant to implement a feature. The AI produces working code. The developer glances at it, confirms the tests pass, and commits. In the code review, the reviewer reads the diff but does not challenge the approach because the tests are green and the code looks reasonable. Nobody asks: “What is this change supposed to do?” or “What acceptance criteria did you verify it against?”
The team has adopted AI tooling to move faster, but the review standard has not changed to match. Before AI, developers implicitly understood intent because they built the solution themselves. With AI, developers commit code without articulating what it should do or how they validated it. The gap between “tests pass” and “I verified it does what we need” is where bugs and vulnerabilities hide.
Common variations:
- The approval-without-criteria. The reviewer approves because the tests pass and the code is syntactically clean. Nobody checks whether the change satisfies the stated acceptance criteria or handles the security constraints defined for the work item. Vulnerabilities - SQL injection, broken access control, exposed secrets - ship because the reviewer checked that it compiles, not that it meets requirements.
- The AI-fixes-AI loop. A bug is found in AI-generated code. The developer asks the AI to fix it. The AI produces a patch. The developer commits the patch without revisiting what the original change was supposed to do or whether the fix satisfies the same criteria.
- The missing edge cases. The AI generates code that handles the happy path correctly. The developer does not add tests for edge cases because they did not think of them - they delegated the thinking to the AI. The AI did not think of them either.
- The false confidence. The team’s test suite has high line coverage. AI-generated code passes the suite. The team believes the code is correct because coverage is high. But coverage measures execution, not correctness. Lines are exercised without the assertions that would catch wrong behavior.
The telltale sign: when a bug appears in AI-generated code, the developer who committed it cannot describe what the change was supposed to do or what acceptance criteria it was verified against.
Why This Is a Problem
It creates unverifiable code
Code committed without acceptance criteria is code that nobody can verify later. When a bug appears three months later, the team has no record of what the change was supposed to do. They cannot distinguish “the code is wrong” from “the code is correct but the requirements changed” because the requirements were never stated.
Without documented intent and acceptance criteria, the team treats AI-generated code as a black box. Black boxes get patched around rather than fixed, accumulating workarounds that make the code progressively harder to change.
It introduces security vulnerabilities
AI models generate code based on patterns in training data. Those patterns include insecure code. An AI assistant will produce code with SQL injection vulnerabilities, hardcoded secrets, missing input validation, or broken authentication flows if the prompt does not explicitly constrain against them - and sometimes even if it does.
A developer who defines security constraints as acceptance criteria before generating code would catch many of these issues because the criteria would include “rejects SQL fragments in input” or “secrets are read from environment, never hardcoded.” Without those criteria, the developer has nothing to verify against. The vulnerability ships.
It degrades the team’s domain knowledge
When developers delegate implementation to AI and commit without articulating intent and acceptance criteria, the team stops making domain knowledge explicit. Over time, the criteria for “correct” exist only in the AI’s training data - which is frozen, generic, and unaware of the team’s specific constraints.
This knowledge loss is invisible at first. The team is shipping features faster. But when something goes wrong - a production incident, an unexpected interaction, a requirement change - the team discovers they have no documented record of what the system is supposed to do, only what the AI happened to generate.
Impact on continuous delivery
CD requires that every change is deployable with high confidence. Confidence comes from knowing what the change does, verifying it against acceptance criteria, and knowing how to detect if it fails. When developers commit code without articulating intent or criteria, the confidence is synthetic: based on test results, not on verified requirements.
Synthetic confidence fails under stress. When a production incident involves AI-generated code, the team’s mean time to recovery increases because they have no documented intent to compare against. When a requirement changes, the developers cannot assess the impact because there is no record of what the current behavior was supposed to be.
How to Fix It
Step 1: Establish the “own it or don’t commit it” rule (Week 1)
Add a working agreement: any code committed to the repository - regardless of whether a human or an AI wrote it - must be owned by the committing developer. Ownership means the developer can answer three questions: what does this change do, what acceptance criteria did I verify it against, and how would I detect if it were wrong in production?
This does not mean the developer must trace every line of implementation. It means they must understand the change’s intent, its expected behavior, and its validation strategy. The AI handles the how. The developer owns the what and the how do we know it works. See the Agent Delivery Contract for how this ownership model works in practice.
- Add the rule to the team’s working agreements.
- In code reviews, reviewers ask the author: what does this change do, what criteria did you verify, and what would a failure look like? If the author cannot answer, the review is not approved until they can.
- Track how often reviews are sent back for insufficient ownership. This is a leading indicator of how often unexamined code was reaching the review stage.
Step 2: Require acceptance criteria before AI-assisted implementation (Weeks 2-3)
Before a developer asks an AI to implement a feature, the acceptance criteria must be written and reviewed. The criteria serve two purposes: they constrain the AI’s output, and they give the developer a checklist to verify the result against.
- Each work item must include specific, testable acceptance criteria before implementation starts.
- AI prompts should reference the acceptance criteria explicitly.
- The developer verifies the AI output against every criterion before committing.
Step 3: Add security-focused review for AI-generated code (Weeks 2-4)
AI-generated code has a higher baseline risk of security vulnerabilities because the AI optimizes for functional correctness, not security.
- Add static application security testing (SAST) tools to the pipeline that flag common vulnerability patterns.
- For AI-assisted changes, the code review checklist includes: input validation, access control, secret handling, and injection prevention.
- Track the rate of security findings in AI-generated code vs human-written code. If AI-generated code has a higher rate, tighten the review criteria.
Step 4: Strengthen the test suite to catch AI blind spots (Weeks 3-6)
AI-generated code passes your tests. The question is whether your tests are good enough to catch wrong behavior.
- Add mutation testing to measure test suite effectiveness. If mutants survive in AI-generated code, the tests are not asserting on the right things.
- Require edge case tests for every AI-generated function: null inputs, boundary values, malformed data, concurrent access where applicable.
- Review test coverage not by lines executed but by behaviors verified. A function with 100% line coverage and no assertions on error paths is undertested.
| Objection | Response |
|---|---|
| “This slows down the speed benefit of AI tools” | The speed benefit is real only if the code is correct. Shipping bugs faster is not a speed improvement - it is a rework multiplier. A 10-minute review that catches a vulnerability saves days of incident response. |
| “Our developers are experienced - they can spot problems in AI output” | Experience helps, but scanning code is not the same as verifying it against criteria. Experienced developers who rubber-stamp AI output still miss bugs because they are reviewing implementation rather than checking whether it satisfies stated requirements. The rule creates the expectation to verify against criteria. |
| “We have high test coverage already” | Coverage measures execution, not correctness. A test that executes a code path but does not assert on its behavior provides coverage without confidence. Mutation testing reveals whether the coverage is meaningful. |
| “Requiring developers to explain everything is too much overhead” | The rule is not “trace every line.” It is “explain what the change does and how you validated it.” A developer who owns the change can answer those questions in two minutes. A developer who cannot answer them should not commit it. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Code reviews returned for insufficient ownership | Should start high and decrease as developers internalize the review standard |
| Security findings in AI-generated code | Should decrease as review and static analysis improve |
| Defects in AI-generated code vs human-written code | Should converge as the team applies equal rigor to both |
| Mutation testing survival rate | Should decrease as test assertions become more specific |
| Mean time to resolve defects in AI-generated code | Should decrease as documented intent and criteria make it faster to identify what went wrong |
Related Content
- AI-Generated Code Ships Without Developer Understanding - The symptom this anti-pattern produces
- Pitfalls and Metrics - Failure modes when adopting AI coding tools
- AI Adoption Roadmap - Prerequisites for safe AI-assisted development
- Testing Fundamentals - Building tests that verify behavior, not just execution
- Inverted Test Pyramid - A test structure that lets incorrect AI code pass undetected
- Working Agreements - Making review standards explicit and enforceable
9 - Manually Triggered Tests
Category: Testing & Quality | Quality Impact: High
What This Looks Like
Your team has tests. They are written, they pass when they run, and everyone agrees they are valuable. The problem is that no automated process runs them. Developers are expected to execute the test suite locally before pushing changes, but “expected to” and “actually do” diverge quickly under deadline pressure. A pipeline might exist, but triggering it requires navigating to a UI and clicking a button - something that gets skipped when the fix feels obvious or when the deploy is already late.
The result is that test execution becomes a social contract rather than a mechanical guarantee. Some developers run everything religiously. Others run only the tests closest to the code they changed. New team members do not yet know which tests matter. When a build breaks in production, the postmortem reveals that no one ran the full suite before the deploy because it felt redundant, or because the manual trigger step had not been documented anywhere visible.
The pattern often hides behind phrases like “we always test before releasing” - which is technically true, because a human can usually be found who will run the tests if asked. But “usually” and “when asked” are not the same as “every time, automatically, as a hard gate.”
Common variations:
- Local-only testing. Developers run tests on their own machines but no CI system enforces coverage on every push, so divergent environments produce inconsistent results.
- Optional pipeline jobs. A CI configuration exists but the test stage is marked optional or is commented out, making it easy to deploy without test results.
- Manual QA handoff. Automated tests exist for unit coverage, but integration and regression tests require a QA engineer to schedule and run a separate test pass before each release.
- Ticket-triggered testing. A separate team owns the test environment, and running tests requires filing a request that may take hours or days to fulfill.
The telltale sign: the team cannot point to a system that will refuse to deploy code if the tests have not passed within the last pipeline run.
Why This Is a Problem
When test execution depends on human initiative, you lose the only property that makes tests useful as a safety net: consistency.
It reduces quality
A regression ships to production not because the tests would have missed it, but because no one ran them. The postmortem reveals the test existed and would have caught the bug in seconds. Tests that run inconsistently catch bugs inconsistently. A developer who is confident in a small change skips the full suite and ships a regression. Another developer who is new to the codebase does not know which manual steps to follow and pushes code that breaks an integration nobody thought to test locally.
Teams in this state tend to underestimate their actual defect rate. They measure bugs reported in production, but they do not measure the bugs that would have been caught if tests had run on every commit. Over time the test suite itself degrades - tests that only run sometimes reveal flakiness that nobody bothers to fix, which makes developers less likely to trust results, which makes them less likely to run tests at all.
A fully automated pipeline treats tests as a non-negotiable gate. Every commit triggers the same sequence, every developer gets the same feedback, and the suite either passes or it does not. There is no room for “I figured it would be fine.”
It increases rework
A defect introduced on Monday sits in the codebase until Thursday, when someone finally runs the tests. By then, three more developers have committed code that depends on the broken behavior. The fix is no longer a ten-minute correction - it is a multi-commit investigation. When a bug escapes because tests were not run, it travels further before it is caught. By the time it surfaces in a staging environment or in production, the fix requires understanding what changed across multiple commits from multiple developers, which multiplies the debugging effort.
Manual testing cycles also introduce waiting time. A developer who needs a QA engineer to run the integration suite before merging is blocked for however long that takes. That waiting time is pure waste - the code is written, the developer is ready to move on, but the process cannot proceed until a human completes a step that a machine could do in minutes. Those waits compound across a team of ten developers, each waiting multiple times per week.
Automated tests that run on every commit catch regressions at the point of introduction, when the developer who wrote the code is still mentally loaded with the context needed to fix it quickly.
It makes delivery timelines unpredictable
A release nominally scheduled for Friday reveals on Thursday afternoon that three tests are failing and two of them touch the payment flow. No one knew because no one had run the full suite since Monday. Because tests run irregularly, the team cannot say with confidence whether the code in the main branch is deployable right now.
The discovery of quality problems at release time compresses the fix window to its smallest possible size, which is exactly when pressure to skip process is highest. Teams respond by either delaying the release or shipping with known failures, both of which erode trust and create follow-on work. Neither outcome would be necessary if the same tests had been running automatically on every commit throughout the sprint.
Impact on continuous delivery
CD requires that the main branch be releasable at any time. That property cannot be maintained without automated tests running on every commit. Manually triggered tests create gaps in verification that can last hours or days, meaning the team never actually knows whether the codebase is in a deployable state between manual runs.
The feedback loop that CD depends on - commit, verify, fix, repeat - collapses when verification is optional. Developers lose the fast signal that automated tests provide, start making larger changes between test runs to amortize the manual effort, and the batch size of unverified work grows. CD requires small batches and fast feedback; manually triggered tests produce the opposite.
How to Fix It
Step 1: Audit what tests exist and where they live
Before automating, understand what you have. List every test suite - unit, integration, end-to-end, contract - and document how each one is currently triggered. Note which ones are already in a CI pipeline versus which require manual steps. This inventory becomes the prioritized list for automation.
Step 2: Wire the fastest tests to every commit
Start with the tests that run in under two minutes - typically unit tests and fast integration tests. Configure your CI system to run these automatically on every push to every branch. The goal is to get the shortest meaningful feedback loop running without any human involvement. Flaky tests that would slow this down should be quarantined and fixed rather than ignored.
Step 3: Add integration and contract tests to the pipeline (Weeks 3-4)
After the fast gate is stable, add the slower test suites as subsequent stages in the pipeline. These may run in parallel to keep total pipeline duration reasonable. Make these stages required - a pipeline run that skips them should not be allowed to proceed to deployment.
Step 4: Remove or deprecate manual triggers
Once the automated pipeline covers what the manual process covered, remove the manual trigger options or mark them clearly as deprecated. The goal is to make “run tests manually” unnecessary, not to maintain it as a parallel path. If stakeholders are accustomed to requesting manual test runs, communicate the change and the new process for reviewing test results.
Step 5: Enforce the pipeline as the deployment gate
Configure your deployment tooling to require a passing pipeline run before any deployment proceeds. In GitHub-based workflows this is a branch protection rule. In other systems it is a pipeline dependency. The pipeline must be the only path to production - not a recommendation but a hard gate.
| Objection | Response |
|---|---|
| “Our tests take too long to run automatically every time.” | Start by automating only the fast tests. Speed up the slow ones over time using parallelization. Running slow tests automatically is still better than running no tests automatically. |
| “Developers should be trusted to run tests before pushing.” | Trust is not a reliability mechanism. Automation runs every time without judgment calls about whether it is necessary. |
| “We do not have a CI system set up.” | Most source control hosts (GitHub, GitLab, Bitbucket) include CI tooling at no additional cost. Setup time is typically under a day for basic pipelines. |
| “Our tests are flaky and will block everyone if we make them required.” | Flaky tests are a separate problem that needs fixing, but that does not mean tests should stay optional. Quarantine known flaky tests and fix them while running the stable ones automatically. |
Measuring Progress
| Metric | What to look for |
|---|---|
| Build duration | Decreasing as flaky or redundant tests are fixed and parallelized; stable execution time per commit |
| Change fail rate | Declining trend as automated tests catch regressions before they reach production |
| Lead time | Reduction in the time between commit and deployable state as manual test wait times are eliminated |
| Mean time to repair | Shorter repair cycles because defects are caught earlier when the developer still has context |
| Development cycle time | Reduced waiting time between code complete and merge as manual QA handoff steps are eliminated |