A test architecture that lets your pipeline deploy confidently, regardless of external system availability, is a core CD capability. The child pages cover each test type.
A CD pipeline’s job is to force every artifact to prove it is worthy of delivery. That proof only works when test changes ship with the code they validate. If a developer adds a feature but the corresponding tests arrive in a later commit, the pipeline approved an artifact it never actually verified. That is not a CD pipeline. It is a CI pipeline with a deploy step. Tests and production code must always travel together through the pipeline as a single unit of change.
Beyond the Test Pyramid
The test pyramid says: write many fast unit tests at the base, fewer integration tests in the middle, and only a handful of end-to-end tests at the top. The underlying principle is sound - lower-level tests are faster, more deterministic, and cheaper to maintain.
The principle behind the shape
The pyramid’s shape communicates a principle: prefer fast, deterministic tests that you fully control. Tests at the
base are cheap to write, fast to run, and reliable. Tests at the top are slow, expensive, and depend on systems outside
your control. The more weight you put at the base, the faster and more reliable your pipeline becomes - to a point. We also have the engineering goal of achieving the most functional coverage with the fewest number of tests. Every test costs money to maintain and adds time to the pipeline.
The testing trophy
The testing trophy, popularized by Kent C. Dodds, rebalances the pyramid by putting component tests at the center. Where the pyramid emphasizes unit tests at the base, the trophy argues that component tests give you the most confidence per test because they exercise realistic user behavior through a component’s public interface while still using test doubles for external dependencies.
The trophy also makes static analysis explicit as the foundation. Linting, type checking, and formatting catch entire categories of defects for free - no test code to write or maintain.
Both models agree on the principle: keep end-to-end tests few and focused, and maximize fast, deterministic coverage. The trophy simply shifts where that coverage concentrates. For teams building component-heavy applications, the trophy distribution often produces better results than a strict pyramid.
Teams often miss this underlying principle and treat either shape as a metric. They count tests by type and debate ratios - “do we have enough unit tests?” or “are our integration tests too many?” - when the real question is:
Can our pipeline determine that a change is safe to deploy without depending on any system we do not control?
A pipeline that answers yes can deploy at any time - even when a downstream service is down, a third-party API is slow, or a partner team hasn’t shipped yet. That independence is what CD requires, and it is the reason the pyramid favors the base.
What this looks like in practice
A test architecture that achieves this has three responsibilities:
Fast, deterministic tests - unit, component, and contract tests - run on every commit using test doubles for external dependencies. They give a reliable go/no-go signal in minutes.
Acceptance tests validate that a deployed artifact is deliverable. Acceptance testing is not a single test type. It is a pipeline stage that can include component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test that runs after CI to gate promotion to production is an acceptance test.
Integration tests validate that contract test doubles still match the real external systems. They run in a dedicated test environment with versioned test data, on demand or on a schedule, providing monitoring rather than gating.
The anti-pattern: the ice cream cone
Most teams that struggle with CD have inverted the pyramid - too many slow, flaky end-to-end tests and too few fast, focused ones. Manual gates block every release. The pipeline cannot give a fast, reliable answer, so deployments become high-ceremony events.
Test Architecture
A test architecture is the deliberate structure of how different test types work together across
your pipeline to give you deployment confidence. Use the table below to decide what type of test
to write and where it runs. This is not a comprehensive list. It shows how common tests impact
pipeline design and how teams should structure their suites. See the
Pipeline Reference Architecture
for a complete quality gate sequence.
The critical insight: everything that blocks merge is deterministic and under your
control. Acceptance tests gate production promotion after verifying the deployed artifact.
Everything that involves real external systems runs post-deployment. This is what gives you
the independence to deploy any time, regardless of the state of the world around you.
Pre-merge vs post-merge
The table maps to two distinct phases of your pipeline, each with different goals and
constraints.
Pre-merge (before code lands on trunk): Run unit, component, and contract tests. These must all be
deterministic and fast. Target: under 10 minutes total. This is the quality gate that every
change must pass. If pre-merge tests are slow, developers batch up changes or skip local runs,
both of which undermine continuous integration.
Post-merge (after code lands on trunk, before or after deployment): Re-run the full
deterministic suite against the integrated trunk. Then run acceptance tests, E2E smoke tests, and
synthetic monitoring post-deploy.
Integration tests run separately in a test environment, on demand or on a schedule. Target: under
60 minutes for the full post-merge cycle.
Why re-run pre-merge tests post-merge? Two changes can each pass pre-merge independently but
conflict when combined on trunk. The post-merge run catches these integration effects.
If a post-merge failure occurs, the team fixes it immediately. Trunk must always be releasable.
This post-merge re-run is what teams traditionally call regression testing: running all previous tests against the current artifact to confirm that existing behavior still works after a change. In CD, regression testing is not a separate test type or a special suite. Every test in the pipeline is a regression test. The deterministic suite runs on every commit, and the full suite runs post-merge. If all tests pass, the artifact has been regression-tested.
good practices
Do
Run tests on every commit. If tests do not run automatically, they will be skipped.
Keep the deterministic suite under 10 minutes. If it is slower, developers will stop
running it locally.
Fix broken tests immediately. A broken test is equivalent to a broken build.
Delete tests that do not provide value. A test that never fails and tests trivial behavior
is maintenance cost with no benefit.
Test behavior, not implementation. Use a
black box approach - verify what the code
does, not how it does it. As Ham Vocke advises: “if I enter values x and y, will the
result be z?” - not the sequence of internal calls that produce z. Avoid
white box testing that asserts on internals.
Use test doubles for external dependencies. Your deterministic tests should run without
network access to external systems.
Validate test doubles with contract tests. Test doubles that drift from reality give false
confidence.
Treat test code as production code. Give it the same care, review, and refactoring
attention.
Run automated accessibility checks on every commit. WCAG compliance scans are fast,
deterministic, and catch violations that are invisible to sighted developers. Treat them
like security scans: automate the detectable rules and reserve manual review for
subjective judgment.
Do Not
Do not tolerate flaky tests. Quarantine or delete them immediately.
Do not gate your pipeline on non-deterministic tests. E2E and integration test failures
should trigger review or alerts, not block deployment.
Do not couple your deployment to external system availability. If a third-party API being
down prevents you from deploying, your test architecture has a critical gap.
Do not write tests after the fact as a checkbox exercise. Tests written without
understanding the behavior they verify add noise, not value.
Do not test private methods directly. Test the public interface; private methods are tested
indirectly.
Do not share mutable state between tests. Each test should set up and tear down its own
state.
Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or
event-driven assertions.
Do not require a running database or external service for unit or component tests. That
makes them integration or end-to-end tests - which is fine, but categorize them correctly
and run them post-deployment, not as a pre-merge gate.
Do not make exploratory or usability testing a release gate. These activities are
continuous and inform product direction; they are not a pass/fail checkpoint before deployment.
Related Content
ACD - How acceptance criteria make testing the constraint that governs agent-generated code
Deterministic tests that verify a complete frontend component or backend service through its public interface, using test doubles for all external dependencies.
Definition
A component test verifies a complete component - either a frontend component rendered
in isolation, or a backend service exercised through its public interface - with
test doubles replacing all external dependencies.
No real databases, downstream services, or network calls leave the process. The test
treats the component as a black box:
inputs go in through the public interface (API endpoint, rendered UI), observable
outputs come out, and the test asserts only on those outputs.
This is broader than a sociable unit test:
where a sociable unit test allows in-process collaborators for a single behavior, a
component test exercises the entire assembled component through its public interface.
The goal is to verify the assembled behavior of a component - that its modules,
business logic, and interface layer work together correctly - without depending on
any system the team does not control.
When to Use
You need to verify a complete user-facing feature from input to output within
a single deployable unit.
You want to test how the UI, business logic, and data layer collaborate without
depending on live external services or databases.
You need to simulate realistic user workflows (filling in forms, navigating pages,
submitting API requests) while keeping the test fast and repeatable.
You are validating acceptance criteria for a user story and want a test that
maps directly to the specified behavior.
You need to verify keyboard navigation, focus management, and screen reader
announcements as part of feature verification.
If the test needs a real external dependency (live database, live downstream service),
it is an end-to-end test. If it tests a single
unit in isolation, it is a unit test.
Characteristics
Property
Value
Speed
Milliseconds to seconds
Determinism
Always deterministic
Scope
A complete frontend component or backend service
Dependencies
All external systems replaced with test doubles
Network
Localhost only
Database
None or in-memory only
Breaks build
Yes
Examples
Backend Service
A component test for a REST API, exercising the full application stack with the
downstream inventory service replaced by a test double:
Backend component test - order creation with mocked inventory service
describe("POST /orders",()=>{it("should create an order and return 201",async()=>{// Arrange: mock the inventory service responsehttpMock("https://inventory.internal").onGet("/stock/item-42").reply(200,{available:true,quantity:10});// Act: send a request through the full application stackconst response =awaitrequest(app).post("/orders").send({itemId:"item-42",quantity:2});// Assert: verify the public interface responseexpect(response.status).toBe(201);expect(response.body.orderId).toBeDefined();expect(response.body.status).toBe("confirmed");});it("should return 409 when inventory is insufficient",async()=>{httpMock("https://inventory.internal").onGet("/stock/item-42").reply(200,{available:true,quantity:0});const response =awaitrequest(app).post("/orders").send({itemId:"item-42",quantity:2});expect(response.status).toBe(409);expect(response.body.error).toMatch(/insufficient/i);});});
Frontend Component
A component test exercising a login flow with a mocked authentication service:
Frontend component test - login flow with mocked auth service
describe("Login page",()=>{it("should redirect to the dashboard after successful login",async()=>{
mockAuthService.login.mockResolvedValue({token:"abc123"});render(<App />);await userEvent.type(screen.getByLabelText("Email"),"ada@example.com");await userEvent.type(screen.getByLabelText("Password"),"s3cret");await userEvent.click(screen.getByRole("button",{name:"Sign in"}));expect(await screen.findByText("Dashboard")).toBeInTheDocument();});});
Accessibility Verification
Component tests already exercise the UI from the actor’s perspective, making them the
natural place to verify that interactions work for all users. Accessibility assertions
fit alongside existing assertions rather than in a separate test suite.
Accessibility component test - keyboard navigation and WCAG assertions
// accessibility scanner setupdescribe("Checkout flow",()=>{it("should be completable using only the keyboard",async()=>{render(<CheckoutPage />);await userEvent.tab();expect(screen.getByLabelText("Card number")).toHaveFocus();await userEvent.type(screen.getByLabelText("Card number"),"4111111111111111");await userEvent.tab();await userEvent.type(screen.getByLabelText("Expiry"),"12/27");await userEvent.tab();await userEvent.keyboard("{Enter}");expect(await screen.findByText("Order confirmed")).toBeInTheDocument();const results =awaitaccessibilityScanner(document.body);expect(results).toHaveNoViolations();});});
Anti-Patterns
Using live external services: making real network calls to external systems makes
the test non-deterministic and slow. Replace everything outside the component boundary
with test doubles.
Using a live database: a live database introduces ordering dependencies and shared
state between tests. Use in-memory databases or mocked data layers.
Ignoring the actor’s perspective: component tests should interact with the system
the way a user or API consumer would. Reaching into internal state or bypassing the
public interface defeats the purpose.
Duplicating unit test coverage: component tests should focus on feature-level
behavior and happy/critical paths. Leave exhaustive edge case and permutation testing
to unit tests.
Slow test setup: if bootstrapping the component takes too long, invest in faster
initialization (in-memory stores, lazy loading) rather than skipping component tests.
Deferring accessibility testing to manual audits: automated WCAG checks in
component tests catch violations on every commit. Quarterly audits find problems that
are weeks old.
Connection to CD Pipeline
Component tests run after unit tests in the pipeline and provide the broadest fast,
deterministic feedback before code is promoted:
Local development: run before committing. Deterministic scope keeps them fast
enough to run locally without slowing the development loop.
PR verification: CI executes the full suite; failures block merge.
Trunk verification: the same tests run on the merged HEAD to catch conflicts.
Pre-deployment gate: component tests can serve as the final deterministic gate
before a build artifact is promoted.
Because component tests are deterministic, they should always break the build on
failure. A healthy CD pipeline relies
on a strong component test suite to verify assembled behavior - not just individual
units - before any code reaches an environment with real dependencies.
2 - Contract Tests
Deterministic tests that verify interface boundaries with external systems using test doubles. Also called narrow integration tests. Validated by integration tests running against real systems.
Definition
A contract test (also called a narrow integration test) is a deterministic test that
validates your code’s interaction with an external system’s interface using
test doubles. It verifies that the boundary
layer code - HTTP clients, database query layers, message producers - correctly handles
the expected request/response shapes, field names, types, and status codes.
A contract test validates interface structure, not business behavior. It answers
“does my code correctly interact with the interface I expect?” not “is the logic correct?”
Business logic belongs in component tests.
Because contract tests use test doubles rather than live systems, they are
deterministic and run on every commit as part of the pipeline. They block the build
on failure, just like unit and component tests.
Integration tests validate that contract
test doubles still match the real external systems by running against live dependencies
post-deployment.
Consumer and Provider Perspectives
Every contract has two sides. The questions each side is trying to answer are different.
Consumer contract testing
The consumer is the service or component that depends on an external API. A consumer
contract test asks:
“Do the fields I depend on still exist, in the types I expect, with the status codes
I handle?”
Consumer tests assert only on the subset of the API the consumer actually uses - not
everything the provider exposes. A consumer that only needs id and email from a user
object should not assert on address or phone. This allows providers to add new fields
freely without breaking consumers.
Following Postel’s Law - “be conservative in what you send, be liberal in what you accept”
consumer tests should accept any valid response that contains the fields they need, and
tolerate fields they do not use.
What a consumer is trying to discover:
Has the provider changed or removed a field I depend on?
Has the provider changed a type I expect (string to integer, object to array)?
Has the provider changed a status code I handle?
Does the provider still accept the request format I send?
Provider contract testing
The provider is the service that owns the API. A provider contract test asks:
“Have my changes broken any of my consumers?”
A provider runs contract tests to verify that its API responses still satisfy the
expectations of every known consumer. This gives early warning - before any consumer
deploys and discovers the breakage - that a change is breaking.
What a provider is trying to discover:
Have I removed or renamed a field that a consumer depends on?
Have I changed a type in a way that breaks deserialization for a consumer?
Have I changed error behavior (status codes, error formats) that consumers handle?
Is my API still backward compatible with all published consumer expectations?
Approaches to Contract Testing
Consumer-driven contract development
In consumer-driven contracts (CDC), the consumer writes the contract. The consumer
defines their expectations as executable tests - what request they will send and what
response shape they require. These expectations are published to a shared contract broker and the provider runs them
as part of their own build.
The flow:
Consumer team writes tests defining their expectations against a mock provider.
The consumer tests generate a contract artifact.
The contract is published to a shared contract broker.
The provider team runs the consumer’s contract expectations against their real
implementation.
If the provider’s implementation satisfies the contract, the provider can deploy
with confidence it will not break this consumer. If not, the teams negotiate before
merging the breaking change.
CDC works well for evolving systems: it grounds the API design in actual consumer
needs rather than the provider’s assumptions about what consumers will use.
Contract-first development
In contract-first development, the interface is defined as a formal artifact -
an OpenAPI specification, a Protobuf schema, an Avro schema, or similar - before
any implementation is written. Both the consumer and provider code are generated from
or validated against that artifact.
The flow:
Teams agree on the interface contract (usually during design or story refinement).
The contract is committed to version control.
Consumer and provider teams develop independently, each generating or validating
their code against the contract.
Tests on both sides verify conformance to the contract - not to each other’s
implementation.
Contract-first works well for new APIs and parallel development: it lets consumer
and provider teams work simultaneously without waiting for a real implementation, and
makes the interface an explicit design decision rather than an emergent one.
Choosing between them
Situation
Prefer
Existing API with multiple consumers, evolving over time
Consumer-driven (CDC)
New API, teams working in parallel
Contract-first
Third-party API you do not control
Consumer-only contract tests (no provider side)
Public API with external consumers you cannot reach
Provider tests against published spec
The two approaches are not mutually exclusive. A team may define an initial contract-first
schema and then adopt CDC tooling as the number of consumers grows.
Characteristics
Property
Value
Speed
Milliseconds to seconds
Determinism
Always deterministic (uses test doubles)
Scope
Interface boundary between two systems
Dependencies
All replaced with test doubles
Network
None or localhost only
Database
None
Breaks build
Yes
Examples
A consumer contract test using a consumer-driven contract tool:
Consumer contract test - order service consuming inventory API
describe("Order Service - Inventory Provider Contract",()=>{it("should receive stock availability in the expected format",async()=>{// Define what the consumer expects from the providerawait contractTool.addInteraction({state:"item-42 is in stock",uponReceiving:"a request for item-42 stock",withRequest:{method:"GET",path:"/stock/item-42"},willRespondWith:{status:200,body:{// Only assert on fields the consumer actually usesavailable:matchType(true),// booleanquantity:matchType(10),// integer},},});// Exercise the consumer code against the mock providerconst result =await inventoryClient.checkStock("item-42");expect(result.available).toBe(true);});});
A provider verification test that runs consumer expectations against the real implementation:
Provider verification - running consumer contracts against the real API
describe("Inventory Service - Provider Verification",()=>{it("should satisfy all registered consumer contracts",async()=>{await contractBroker.verifyProvider({provider:"InventoryService",providerBaseUrl:"http://localhost:3001",brokerUrl:"https://contract-broker.internal",providerVersion: process.env.GIT_SHA,});});});
A contract-first schema validation test verifying a provider response against an OpenAPI spec:
Contract-first test - OpenAPI schema validation
describe("GET /stock/:id - OpenAPI contract",()=>{it("should return a response conforming to the published schema",async()=>{const response =awaitfetch("http://localhost:3001/stock/item-42");const body =await response.json();// Validate against the OpenAPI schema, not specific valuesexpect(response.status).toBe(200);expect(typeof body.available).toBe("boolean");expect(typeof body.quantity).toBe("number");// Additional fields the consumer does not use are not asserted on});});
Anti-Patterns
Asserting on business logic: contract tests verify structure, not behavior. A contract
test that asserts quantity > 0 when in stock is crossing into business logic territory.
That belongs in component tests.
Asserting on fields the consumer does not use: over-specified consumer contracts make
providers brittle. Only assert on what your code actually reads.
Testing specific data values: asserting that name equals "Alice" makes the test
brittle. Assert on types, required fields, and status codes instead.
Hitting live systems in contract tests: contract tests must use test doubles to stay
deterministic. Validating doubles against live systems is the role of
integration tests, which run post-deployment.
Running infrequently: contract tests should run often enough to catch drift before it
causes a production incident. High-volatility APIs may need hourly runs.
Skipping provider verification in CDC: publishing consumer expectations is only half
the pattern. The provider must actually run those expectations for CDC to work.
Connection to CD Pipeline
Contract tests run on every commit as part of the deterministic pipeline:
Contract tests in the pipeline
On every commit Unit tests Deterministic Blocks
Component tests Deterministic Blocks
Contract tests Deterministic Blocks
Post-deployment Integration tests Non-deterministic Validates contract doubles
E2E smoke tests Non-deterministic Triggers rollback
Contract tests verify that your boundary layer code correctly interacts with the
interfaces you depend on. Integration tests
validate that those test doubles still match the real external systems by running
against live dependencies post-deployment.
3 - End-to-End Tests
Tests that exercise two or more real components up to the full system. Non-deterministic by nature; never a pre-merge gate.
Definition
An end-to-end test exercises real components working together - no
test doubles replace the dependencies under
test. The scope ranges from two services calling each other,
to a service talking to a real database, to a complete user journey through every
layer of the system.
The defining characteristic is that real external dependencies are present: actual
databases, live downstream services, real message brokers, or third-party APIs.
Because those dependencies introduce timing, state, and availability factors outside
the test’s control, end-to-end tests are typically non-deterministic. They fail
for reasons unrelated to code correctness - network instability, service unavailability,
test data collisions, or third-party rate limits.
Terminology note
“Integration test” and “end-to-end test” are often used interchangeably in the
industry. Martin Fowler distinguishes between narrow integration tests (which use test
doubles at the boundary - what this site calls
contract tests) and broad integration tests
(which use real dependencies). This site treats them as distinct categories:
integration tests validate that contract
test doubles still match the real external systems, while end-to-end tests exercise
user journeys or multi-service flows through real systems.
Scope
End-to-end tests cover a spectrum based on how many components are real:
Scope
Example
Narrow
A service making real calls to a real database
Service-to-service
Order service calling the real inventory service
Multi-service
A user journey spanning three live services
Full system
A browser test through a staging environment with all dependencies live
All of these involve real external dependencies. All share the same fundamental
non-determinism risk. Use the narrowest scope that gives you the confidence you need.
When to Use
Use end-to-end tests sparingly. They are the most expensive test type to write,
run, and maintain. Use them for:
Smoke testing a deployed environment to verify that key integrations are
functioning after a deployment.
Happy-path validation of critical business flows that cannot be verified any
other way (e.g., a payment flow that depends on a real payment provider).
Cross-team workflows that span multiple deployables and cannot be isolated
within a single component test.
Do not use end-to-end tests to cover edge cases, error handling, or input
validation. Those scenarios belong in unit or
component tests, which are faster, cheaper, and
deterministic.
Vertical vs. horizontal
Vertical end-to-end tests target features owned by a single team:
An order is created and the confirmation email is sent.
A user uploads a file and it appears in their document list.
Horizontal end-to-end tests span multiple teams:
A user navigates from homepage through search, product detail, cart, and checkout.
Horizontal tests have a large failure surface and are significantly more fragile.
They are not suitable for blocking the pipeline; run them on a schedule and
review failures out of band.
Characteristics
Property
Value
Speed
Seconds to minutes per test
Determinism
Typically non-deterministic
Scope
Two or more real components, up to the full system
Dependencies
Real services, databases, brokers, third-party APIs
Network
Full network access
Database
Live databases
Breaks build
No - triggers review or rollback, not a pre-merge gate
Examples
A narrow end-to-end test verifying a service against a real database:
Narrow E2E - order service against a real database
describe("OrderRepository (real database)",()=>{it("should persist and retrieve an order by ID",async()=>{const order =await orderRepository.create({itemId:"item-42",quantity:2,customerId:"cust-99",});const retrieved =await orderRepository.findById(order.id);expect(retrieved.itemId).toBe("item-42");expect(retrieved.status).toBe("pending");});});
A full-system browser test using a browser automation framework:
Full-system E2E - add to cart and checkout with browser automation
test("user can add an item to cart and check out",async({ page })=>{await page.goto("https://staging.example.com");await page.getByRole("link",{name:"Running Shoes"}).click();await page.getByRole("button",{name:"Add to Cart"}).click();await page.getByRole("link",{name:"Cart"}).click();awaitexpect(page.getByText("Running Shoes")).toBeVisible();await page.getByRole("button",{name:"Checkout"}).click();awaitexpect(page.getByText("Order confirmed")).toBeVisible();});
Anti-Patterns
Using end-to-end tests as the primary safety net: this is the ice cream cone
anti-pattern. The majority of your confidence should come from unit and
component tests, which are fast and
deterministic. End-to-end tests are expensive insurance for the gaps.
Blocking the pipeline: end-to-end tests must never be a pre-merge gate. Their
non-determinism will eventually block a deploy for reasons unrelated to code quality.
Blocking on horizontal tests: horizontal tests span too many teams and failure
surfaces. Run them on a schedule and review failures as a team.
Ignoring flaky failures: track frequency and root cause. A test that fails for
environmental reasons is not providing a code quality signal - fix it or remove it.
Testing edge cases here: exhaustive permutation testing in end-to-end tests is
slow, expensive, and duplicates what unit and component tests should cover.
Not capturing failure context: end-to-end failures are expensive to debug. Capture
screenshots, network logs, and video recordings automatically on failure.
Connection to CD Pipeline
End-to-end tests run after deployment, not before:
A team may choose to gate on a small, highly reliable set of vertical end-to-end
smoke tests immediately after deployment. This is acceptable only if the team invests
in keeping those tests stable. A flaky smoke gate is worse than no gate: it trains
developers to ignore failures.
Use contract tests to verify that the
test doubles in your component tests still
match reality. This gives you deterministic pre-merge confidence without depending on
live external systems.
4 - Test Feedback Speed
Why test suite speed matters for developer effectiveness and how cognitive limits set the targets.
Why speed has a threshold
The 10-minute CI target and the preference for sub-second unit tests are not arbitrary. They come
from how human cognition handles interrupted work. When a developer makes a change and waits for
test results, three things determine whether that feedback is useful: whether the developer still
holds the mental model of the change, whether they can act on the result immediately, and whether
the wait is short enough that they do not context-switch to something else.
Research on task interruption and working memory consistently shows that context switches are
expensive. Gloria Mark’s research at UC Irvine found that it takes an average of 23 minutes for
a person to fully regain deep focus after being interrupted during a task, and that interrupted
tasks take twice as long and contain twice as many errors as uninterrupted
ones.1 If the test suite itself takes 30 minutes, the total cost of a single
feedback cycle approaches an hour - and most of that time is spent re-loading context, not fixing
code.
The cognitive breakpoints
Jakob Nielsen’s foundational research on response times identified three thresholds that govern
how users perceive and respond to system delays: 0.1 seconds (feels instantaneous), 1 second
(noticeable but flow is maintained), and 10 seconds (attention limit - the user starts thinking
about other things).2 These thresholds, rooted in human perceptual and
cognitive limits, apply directly to developer tooling.
Different feedback speeds produce fundamentally different developer behaviors:
Feedback time
Developer behavior
Cognitive impact
Under 1 second
Feels instantaneous. The developer stays in flow, treating the test result as part of the editing cycle.2
Working memory is fully intact. The change and the result are experienced as a single action.
1 to 10 seconds
The developer waits. Attention may drift briefly but returns without effort.
Working memory is intact. The developer can act on the result immediately.
10 seconds to 2 minutes
The developer starts to feel the wait. They may glance at another window or check a message, but they do not start a new task.
Working memory begins to decay. The developer can still recover context quickly, but each additional second increases the chance of distraction.2
2 to 10 minutes
The developer context-switches. They check email, review a PR, or start thinking about a different problem. When the result arrives, they must actively return to the original task.
Working memory is partially lost. Rebuilding context takes several minutes depending on the complexity of the change.1
Over 10 minutes
The developer fully disengages and starts a different task. The test result arrives as an interruption to whatever they are now doing.
Working memory of the original change is gone. Rebuilding it takes upward of 23 minutes.1 Investigating a failure means re-reading code they wrote an hour ago.
The 10-minute CI target exists because it is the boundary between “developer waits and acts on
the result” and “developer starts something else and pays a full context-switch penalty.” Below
10 minutes, feedback is actionable. Above 10 minutes, feedback becomes an interruption. DORA’s
research on continuous integration reinforces this: tests should complete in under 10 minutes to
support the fast feedback loops that high-performing teams depend on.3
What this means for test architecture
These cognitive breakpoints should drive how you structure your test suite:
Local development (under 1 second). Unit tests for the code you are actively changing should
run in watch mode, re-executing on every save. At this speed, TDD becomes natural - the test
result is part of the writing process, not a separate step. This is where you test complex logic
with many permutations.
Pre-push verification (under 2 minutes). The full unit test suite and the component tests
for the component you changed should complete before you push. At this speed, the developer
stays engaged and acts on failures immediately. This is where you catch regressions.
CI pipeline (under 10 minutes). The full deterministic suite - all unit tests, all component
tests, all contract tests - should complete within 10 minutes of commit. At this speed, the
developer has not yet fully disengaged from the change. If CI fails, they can investigate while
the code is still fresh.
Post-deploy verification (minutes to hours). E2E smoke tests and integration test validation
run after deployment. These are non-deterministic, slower, and less frequent. Failures at this
level trigger investigation, not immediate developer action.
When a test suite exceeds 10 minutes, the solution is not to accept slower feedback. It is to
redesign the suite: replace E2E tests with component tests using test doubles, parallelize test
execution, and move non-deterministic tests out of the gating path.
Impact on application architecture
Test feedback speed is not just a testing concern - it puts pressure on how you design your
systems. A monolithic application with a single test suite that takes 40 minutes to run forces
every developer to pay the full context-switch penalty on every change, regardless of which
module they touched.
Breaking a system into smaller, independently testable components is often motivated as much by
test speed as by deployment independence. When a component has its own focused test suite that
runs in under 2 minutes, the developer working on that component gets fast, relevant feedback.
They do not wait for tests in unrelated modules to finish.
This creates a virtuous cycle: smaller components with clear boundaries produce faster test
suites, which enable more frequent integration, which encourages smaller changes, which are
easier to test. Conversely, a tightly coupled monolith produces a slow, tangled test suite that
discourages frequent integration, which leads to larger changes, which are harder to test and
more likely to fail.
Architecture decisions that improve test feedback speed include:
Clear component boundaries with well-defined interfaces, so each component can be tested
in isolation with test doubles for its dependencies.
Separating business logic from infrastructure so that core rules can be unit tested in
milliseconds without databases, queues, or network calls.
Independently deployable services with their own test suites, so a change to one service
does not require running the entire system’s tests.
Avoiding shared mutable state between components, which forces integration tests and
introduces non-determinism.
If your test suite is slow and you cannot make it faster by optimizing test execution alone, the
architecture is telling you something. A system that is hard to test quickly is also hard to
change safely - and both problems have the same root cause.
The compounding cost of slow feedback
Slow feedback does not just waste time - it changes behavior. When the suite takes 40 minutes,
developers adapt:
They batch changes to avoid running the suite more than necessary, creating larger and riskier
commits.
They stop running tests locally because the wait is unacceptable during active development.
They push to CI and context-switch, paying the full rebuild penalty on every cycle.
They rerun failures instead of investigating, because re-reading the code they wrote an hour
ago is expensive enough that “maybe it was flaky” feels like a reasonable bet.
Each of these behaviors degrades quality independently. Together, they make continuous integration
impossible. A team that cannot get feedback on a change within 10 minutes cannot sustain the
practice of integrating changes multiple times per day.4
Sources
Further reading
Build Duration - Measuring and improving CI pipeline speed
Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps, IT Revolution Press, 2018. ↩︎
5 - Integration Tests
Tests that exercise real external dependencies to validate that contract test doubles still match reality. Non-deterministic; never a pre-merge gate.
“Integration test” is widely used but inconsistently defined. On this site, integration
tests are tests that involve real external dependencies - actual databases, live
downstream services, real message brokers, or third-party APIs. They are non-deterministic
because those dependencies introduce timing, state, and availability factors outside the
test’s control.
Integration tests serve a specific role in the test architecture: they validate that the
test doubles used in your
contract tests still match reality. Without
integration tests, contract test doubles can silently drift from the real behavior of the
systems they simulate - giving false confidence.
Because integration tests depend on live systems, they run post-deployment or on a
schedule - never as a pre-merge gate. Failures trigger review or rollback decisions, not
build failures.
For tests that validate interface boundaries using test doubles (deterministic), see
Contract Tests.
For full-system browser tests and multi-service smoke tests, see
End-to-End Tests.
6 - Static Analysis
Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.
Definition
Static analysis (also called static testing) evaluates non-running code against rules for
known good practices. Unlike other test types that execute code and observe behavior, static
analysis inspects source code, configuration files, and dependency manifests to detect
problems before the code ever runs.
Static analysis serves several key purposes:
Catches errors that would otherwise surface at runtime.
Warns of excessive complexity that degrades the ability to change code safely.
Identifies security vulnerabilities and coding patterns that provide attack vectors.
Enforces coding standards by removing subjective style debates from code reviews.
Alerts to dependency issues such as outdated packages, known CVEs, license
incompatibilities, or supply-chain compromises.
When to Use
Static analysis should run continuously, at every stage where feedback is possible:
In the IDE: real-time feedback as developers type, via editor plugins and language
server integrations.
On save: format-on-save and lint-on-save catch issues immediately.
Pre-commit: hooks prevent problematic code from entering version control.
In CI: the full suite of static checks runs on every PR and on the trunk after merge,
verifying that earlier local checks were not bypassed.
Static analysis is always applicable. Every project, regardless of language or platform,
benefits from linting, formatting, and dependency scanning.
Characteristics
Property
Value
Speed
Seconds (typically the fastest test category)
Determinism
Always deterministic
Scope
Entire codebase (source, config, dependencies)
Dependencies
None (analyzes code at rest)
Network
None (except dependency scanners)
Database
None
Breaks build
Yes
Examples
Linting
A .eslintrc.json configuration enforcing test quality rules:
Statically typed languages catch type mismatches at compile time, eliminating entire classes
of runtime errors. Java, for example, rejects incompatible argument types before the code runs:
Java type checking example
publicstaticdoublecalculateTotal(double price,int quantity){return price * quantity;}// Compiler error: incompatible types: String cannot be converted to doublecalculateTotal("19.99",3);
Dependency Scanning
Dependency scanning tools scan for known vulnerabilities:
npm audit output example
$ npm audit
found 2 vulnerabilities (1 moderate, 1 high)
moderate: Prototype Pollution in lodash <4.17.21
high: Remote Code Execution in log4j <2.17.1
Flags overly deep or long code blocks that breed defects
Type checking
Prevents type-related bugs, replacing some unit tests
Security scanning
Detects known vulnerabilities and dangerous coding patterns
Dependency scanning
Checks for outdated, hijacked, or insecurely licensed deps
Accessibility linting
Detects missing alt text, ARIA violations, contrast failures, semantic HTML issues
Accessibility Linting
Accessibility linting catches deterministic WCAG violations the same way a security scanner
catches known vulnerability patterns. Automated checks cover structural issues (missing alt
text, invalid ARIA attributes, insufficient contrast ratios, broken heading hierarchy) while
manual review covers subjective aspects like whether alt text is actually meaningful.
An accessibility checker configuration running WCAG 2.1 AA checks against rendered pages:
Accessibility checker configuration for WCAG 2.1 AA
An accessibility scanner test asserting that a rendered component has no violations:
Accessibility scanner test verifying no WCAG violations
// accessibility scanner setup (e.g. import scanner and extend assertions)it("should have no accessibility violations",async()=>{const{ container }=render(<LoginForm />);const results =awaitaccessibilityScanner(container);expect(results).toHaveNoViolations();});
Anti-Patterns
Disabling rules instead of fixing code: suppressing linter warnings or ignoring
security findings erodes the value of static analysis over time.
Not customizing rules: default rulesets are a starting point. Write custom rules for
patterns that come up repeatedly in code reviews.
Running static analysis only in CI: by the time CI reports a formatting error, the
developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
Ignoring dependency vulnerabilities: known CVEs in dependencies are a direct attack
vector. Treat high-severity findings as build-breaking.
Treating static analysis as optional: static checks should be mandatory and enforced.
If developers can bypass them, they will.
Connection to CD Pipeline
Static analysis is the first gate in the CDpipeline, providing the fastest feedback:
IDE / local development: plugins run in real time as code is written.
Pre-commit: hooks run linters, formatters, and accessibility checks on changed
components, blocking commits that violate rules.
PR verification: CI runs the full static analysis suite (linting, type checking,
security scanning, dependency auditing, accessibility linting) and blocks merge on
failure.
Trunk verification: the same checks re-run on the merged HEAD to catch anything
missed.
Scheduled scans: dependency and security scanners run on a schedule to catch newly
disclosed vulnerabilities in existing dependencies.
Because static analysis requires no running code, no test environment, and no external
dependencies, it is the cheapest and fastest form of quality verification. A mature CD
pipeline treats static analysis failures the same as test failures: they break the build.
7 - Test Doubles
Patterns for isolating dependencies in tests: stubs, mocks, fakes, spies, and dummies.
Definition
Test doubles are stand-in objects that replace real production dependencies during testing.
The term comes from the film industry’s “stunt double.” Just as a stunt double replaces an
actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency
to make tests fast, isolated, and reliable.
Test doubles allow you to:
Remove non-determinism by replacing network calls, databases, and file systems with
predictable substitutes.
Control test conditions by forcing specific states, error conditions, or edge cases that
would be difficult to reproduce with real dependencies.
Increase speed by eliminating slow I/O operations.
Isolate the system under test so that failures point directly to the code being tested,
not to an external dependency.
Types of Test Doubles
Type
Description
Example Use Case
Dummy
Passed around but never actually used. Fills parameter lists.
A required logger parameter in a constructor.
Stub
Provides canned answers to calls made during the test. Does not respond to anything outside what is programmed.
Returning a fixed user object from a repository.
Spy
A stub that also records information about how it was called (arguments, call count, order).
Verifying that an analytics event was sent once.
Mock
Pre-programmed with expectations about which calls will be made. Verification happens on the mock itself.
Asserting that sendEmail() was called with specific arguments.
Fake
Has a working implementation, but takes shortcuts not suitable for production.
An in-memory database replacing PostgreSQL.
Choosing the Right Double
Use stubs when you need to supply data but do not care how it was requested.
Use spies when you need to verify call arguments or call count.
Use mocks when the interaction itself is the primary thing being verified.
Use fakes when you need realistic behavior but cannot use the real system.
Use dummies when a parameter is required by the interface but irrelevant to the test.
When to Use
Test doubles are used in every layer of deterministic testing:
Unit tests: nearly all dependencies are replaced with test doubles to
achieve full isolation.
Component tests: all dependencies that cross the component boundary
(external APIs, databases, downstream services) are replaced to maintain determinism.
Test doubles should be used less in later pipeline stages.
End-to-end tests use no test doubles by design.
Examples
A JavaScript stub providing a canned response:
JavaScript stub returning a fixed user
// Stub: return a fixed user regardless of inputconst userRepository ={findById:stub().returns(Promise.resolve({id:"u1",name:"Ada Lovelace",email:"ada@example.com",})),};const user =await userService.getUser("u1");expect(user.name).toBe("Ada Lovelace");
A Java spy verifying interaction:
Java spy verifying call count with a mocking framework
@TestpublicvoidshouldCallUserServiceExactlyOnce(){UserService spyService =spy(userService);doReturn(testUser).when(spyService).getUserInfo("u123");User result = spyService.getUserInfo("u123");verify(spyService,times(1)).getUserInfo("u123");assertEquals("Ada", result.getName());}
Mocking what you do not own: wrapping a third-party API in a thin adapter and mocking
the adapter is safer than mocking the third-party API directly. Direct mocks couple your
tests to the library’s implementation.
Over-mocking: replacing every collaborator with a mock turns the test into a mirror of
the implementation. Tests become brittle and break on every refactor. Only mock what is
necessary to maintain determinism.
Not validating test doubles: if the real dependency changes its contract, your test
doubles silently drift. Use contract tests to keep doubles honest.
Complex mock setup: if setting up mocks requires dozens of lines, the system under test
may have too many dependencies. Consider refactoring the production code rather than adding
more mocks.
Using mocks to test implementation details: asserting on the exact sequence and count
of internal method calls creates change-detector tests. Prefer asserting on observable
output.
Connection to CD Pipeline
Test doubles are a foundational technique that enables the fast, deterministic tests required
for continuous delivery:
Early pipeline stages (static analysis, unit tests, component tests, contract tests) rely
heavily on test doubles to stay fast and deterministic. This is where the majority of defects
are caught.
Later pipeline stages (integration tests, E2E tests, production monitoring) use fewer or
no test doubles, trading speed for realism.
Integration tests run post-deployment to validate that the test doubles used in contract
tests still match the real external systems.
The guiding principle from Justin Searls applies: “Don’t poke too many holes in reality.”
Use test doubles when you must, but prefer real implementations when they are fast and
deterministic.
8 - Unit Tests
Fast, deterministic tests that verify a unit of behavior through its public interface, asserting on what the code does rather than how it works.
Definition
A unit test is a deterministic test that exercises a unit of behavior (a single
meaningful action or decision your code makes) and verifies that the observable outcome is
correct. The “unit” is not a function, method, or class. It is a behavior: given these inputs,
the system produces this result. A single behavior may involve one function or several
collaborating objects. What matters is that the test treats the code as a
black box and asserts only on what it produces,
not on how it produces it.
All external dependencies are replaced with test doubles so the test runs
quickly and produces the same result every time.
Solitary vs. sociable unit tests
A solitary unit test replaces all collaborators with test doubles and exercises a single
class or function in complete isolation.
A sociable unit test allows real in-process collaborators to participate - for example,
a service object calling a real domain model - while still replacing any external I/O (network,
database, file system) with test doubles. Both styles are unit tests as long as no real external
dependency is involved.
When the scope expands to an entire frontend component or a complete backend service exercised
through its public API, that is a component test.
White box testing (asserting on internal method
calls, call order, or private state) creates change-detector tests that break during routine
refactoring without catching real defects. Prefer testing through the public interface (methods,
APIs, exported functions) and asserting on return values, state changes visible to consumers,
or observable side effects.
The purpose of unit tests is to:
Verify that a unit of behavior produces the correct observable outcome.
Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
Keep cyclomatic complexity visible and manageable through good separation of concerns.
When to Use
During development: run the relevant subset of unit tests continuously while writing
code. TDD (Red-Green-Refactor) is the most effective workflow.
On every commit: use pre-commit hooks or watch-mode test runners so broken tests never
reach the remote repository.
In CI: execute the full unit test suite on every pull request and on the trunk after
merge to verify nothing was missed locally.
Unit tests are the right choice when the behavior under test can be exercised without network
access, file system access, or database connections. If you need any of those, you likely need
a component test or an end-to-end test instead.
Characteristics
Property
Value
Speed
Milliseconds per test
Determinism
Always deterministic
Scope
A single unit of behavior
Dependencies
All replaced with test doubles
Network
None
Database
None
Breaks build
Yes
Examples
A JavaScript unit test verifying a pure utility function:
JavaScript unit test for castArray utility
// castArray.test.jsdescribe("castArray",()=>{it("should wrap non-array items in an array",()=>{expect(castArray(1)).toEqual([1]);expect(castArray("a")).toEqual(["a"]);expect(castArray({a:1})).toEqual([{a:1}]);});it("should return array values by reference",()=>{const array =[1];expect(castArray(array)).toBe(array);});it("should return an empty array when no arguments are given",()=>{expect(castArray()).toEqual([]);});});
A Java unit test using a mocking framework to isolate the system under test:
Java unit test with mocking framework stub isolating the controller
White box testing: asserting on internal
state, call order, or private method behavior rather than observable output. These
change-detector tests break during refactoring without catching real defects. Test through
the public interface instead.
Testing private methods: private implementations are meant to change. They are
exercised indirectly through the behavior they support. Test the public interface instead.
No assertions: a test that runs code without asserting anything provides false
confidence. Lint rules can catch this automatically.
Disabling or skipping tests: skipped tests erode confidence over time. Fix or remove
them.
Confusing “unit” with “function”: a unit of behavior may span multiple collaborating
objects. Forcing one-test-per-function creates brittle tests that mirror the implementation
structure rather than verifying meaningful outcomes.
Ice cream cone testing: relying primarily on slow E2E tests while neglecting fast unit
tests inverts the test pyramid and slows feedback.
Chasing coverage numbers: gaming coverage metrics (e.g., running code paths without
meaningful assertions) creates a false sense of confidence. Focus on behavior coverage
instead.
Connection to CD Pipeline
Unit tests occupy the base of the test pyramid. They run in the earliest stages of the
CD pipeline and provide the fastest feedback loop:
Local development: watch mode reruns tests on every save.
Pre-commit: hooks run the suite before code reaches version control.
PR verification: CI runs the full suite and blocks merge on failure.
Trunk verification: CI reruns tests on the merged HEAD to catch integration issues.
Because unit tests are fast and deterministic, they should always break the build on failure.
A healthy CD pipeline depends on a large, reliable suite of
black box unit tests that verify behavior
rather than implementation, giving developers the confidence to refactor freely and ship
small changes frequently.
9 - Testing Glossary
Definitions for testing terms as they are used on this site.
These definitions reflect how this site uses each term. They are not universal definitions -
other communities may use the same words differently.
Component Test
A deterministic test that verifies a complete frontend component or backend service through
its public interface, with test doubles for all external dependencies. See
Component Tests for full definition and examples.
A testing approach where the test exercises code through its public interface and asserts
only on observable outputs - return values, state changes visible to consumers, or side
effects such as messages sent. The test has no knowledge of internal implementation details.
Black box tests are resilient to refactoring because they verify what the code does, not
how it does it. Contrast with white box testing.
Automated tests that verify a system behaves as specified. Acceptance tests
exercise user workflows in a
production-like environment and confirm the implementation
matches the acceptance criteria. They answer “did we build what was specified?” rather than
“does the code work?” They do not validate whether the specification itself is correct -
only real user feedback can confirm we are building the right thing.
In CD, acceptance testing is a pipeline stage, not a single test type. It can include
component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test
that runs after CI to gate promotion to production is an acceptance test.
A unit test that allows real in-process collaborators to participate -
for example, a service object calling a real domain model or value object - while still
replacing any external I/O (network, database, file system) with test doubles. The “unit”
being tested is a behavior that spans multiple in-process objects. When the scope expands
to the entire public interface of a frontend component or backend service, that is a
component test.
A unit test that replaces all collaborators with
test doubles and exercises a single class or
function in complete isolation. Contrast with sociable unit test,
which allows real in-process collaborators while still replacing external I/O.
A development practice where tests are written before the production code that makes them
pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing
a fast, reliable test suite. TDD feeds into the testing fundamentals
required in Phase 1.
Automated scripts that continuously execute realistic user journeys or API calls against a
live production (or production-like) environment and alert when those journeys fail or degrade.
Unlike passive monitoring that watches for errors in real user traffic, synthetic monitoring
proactively simulates user behavior on a schedule - so problems are detected even during low
traffic periods. Synthetic monitors are non-deterministic (they depend on live external systems)
and are never a pre-merge gate. Failures trigger alerts or rollback decisions, not build blocks.
A test double that simulates a real external service over the network, responding to HTTP
requests with pre-configured or recorded responses. Unlike in-process stubs or mocks, a
virtual service runs as a standalone process and is accessed via real network calls, making
it suitable for component testing and end-to-end testing where your application needs to
make actual HTTP requests against a dependency. Service virtualization tools can create
virtual services from recorded traffic or API specifications. See
Test Doubles.
A testing approach where the test has knowledge of and asserts on internal implementation
details - specific methods called, call order, internal state, or code paths taken. White
box tests verify how the code works, not what it produces. These tests are fragile
because any refactoring of internals breaks them, even when behavior is unchanged. Avoid
white box testing in unit tests; prefer black box testing that asserts
on observable outcomes.