This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Architecting Tests for CD

Test architecture, types, and good practices for building confidence in your delivery pipeline.

A test architecture that lets your pipeline deploy confidently, regardless of external system availability, is a core CD capability. The child pages cover each test type.

A CD pipeline’s job is to force every artifact to prove it is worthy of delivery. That proof only works when test changes ship with the code they validate. If a developer adds a feature but the corresponding tests arrive in a later commit, the pipeline approved an artifact it never actually verified. That is not a CD pipeline. It is a CI pipeline with a deploy step. Tests and production code must always travel together through the pipeline as a single unit of change.

Beyond the Test Pyramid

The test pyramid: a triangle with Unit Tests at the wide base (fast, cheap, many), Integration/Component in the middle, and End-to-End at the narrow top (slow, expensive, few). Arrows on the sides indicate cost and speed increase toward the top.

The test pyramid says: write many fast unit tests at the base, fewer integration tests in the middle, and only a handful of end-to-end tests at the top. The underlying principle is sound - lower-level tests are faster, more deterministic, and cheaper to maintain.

The principle behind the shape

The pyramid’s shape communicates a principle: prefer fast, deterministic tests that you fully control. Tests at the base are cheap to write, fast to run, and reliable. Tests at the top are slow, expensive, and depend on systems outside your control. The more weight you put at the base, the faster and more reliable your pipeline becomes - to a point. We also have the engineering goal of achieving the most functional coverage with the fewest number of tests. Every test costs money to maintain and adds time to the pipeline.

The testing trophy

The testing trophy: a trophy-shaped diagram where Component Tests form the large diamond-shaped body, Unit Tests form the narrow stem, Static Analysis forms the base pedestal, and End-to-End tests form a small triangle at the peak.

The testing trophy, popularized by Kent C. Dodds, rebalances the pyramid by putting component tests at the center. Where the pyramid emphasizes unit tests at the base, the trophy argues that component tests give you the most confidence per test because they exercise realistic user behavior through a component’s public interface while still using test doubles for external dependencies.

The trophy also makes static analysis explicit as the foundation. Linting, type checking, and formatting catch entire categories of defects for free - no test code to write or maintain.

Both models agree on the principle: keep end-to-end tests few and focused, and maximize fast, deterministic coverage. The trophy simply shifts where that coverage concentrates. For teams building component-heavy applications, the trophy distribution often produces better results than a strict pyramid.

Teams often miss this underlying principle and treat either shape as a metric. They count tests by type and debate ratios - “do we have enough unit tests?” or “are our integration tests too many?” - when the real question is:

Can our pipeline determine that a change is safe to deploy without depending on any system we do not control?

A pipeline that answers yes can deploy at any time - even when a downstream service is down, a third-party API is slow, or a partner team hasn’t shipped yet. That independence is what CD requires, and it is the reason the pyramid favors the base.

What this looks like in practice

A test architecture that achieves this has three responsibilities:

  1. Fast, deterministic tests - unit, component, and contract tests - run on every commit using test doubles for external dependencies. They give a reliable go/no-go signal in minutes.
  2. Acceptance tests validate that a deployed artifact is deliverable. Acceptance testing is not a single test type. It is a pipeline stage that can include component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test that runs after CI to gate promotion to production is an acceptance test.
  3. Integration tests validate that contract test doubles still match the real external systems. They run in a dedicated test environment with versioned test data, on demand or on a schedule, providing monitoring rather than gating.

The anti-pattern: the ice cream cone

The ice cream cone anti-pattern: an inverted test distribution where most testing effort goes to manual and end-to-end tests at the top, with too few fast unit tests at the bottom

Most teams that struggle with CD have inverted the pyramid - too many slow, flaky end-to-end tests and too few fast, focused ones. Manual gates block every release. The pipeline cannot give a fast, reliable answer, so deployments become high-ceremony events.

Test Architecture

A test architecture is the deliberate structure of how different test types work together across your pipeline to give you deployment confidence. Use the table below to decide what type of test to write and where it runs. This is not a comprehensive list. It shows how common tests impact pipeline design and how teams should structure their suites. See the Pipeline Reference Architecture for a complete quality gate sequence.

Four-lane CD pipeline diagram. Pipeline lane: Commit triggers pre-merge and CI checks (Static Analysis, Unit Tests, Component Tests, Contract Tests - deterministic, blocks merge), then Build, Deploy to test environment, Acceptance Tests in test environment (Component, Load, Chaos, Resilience, Compliance - gates promotion to production), Deploy to production, and a green Live checkmark. Post-deploy lane: Production Verification (Health Checks, Real User Monitoring, SLO) triggered after production deploy - non-deterministic, triggers alerts, never blocks promotion. Async lane: Integration Tests validate contract test doubles against real systems - non-deterministic, post-deploy, failures trigger review. Continuous lane: Exploratory Testing and Usability Testing run continuously alongside delivery and never block.
Pipeline StageWhat You Need to VerifyTest TypeSpeedDeterministic?Blocks Deploy?
CIA function or method behaves correctlyUnitMillisecondsYes■ Yes
CIA complete component or service works through its public interfaceComponentMilliseconds to secondsYes■ Yes
CIYour code correctly interacts with external system interfacesContractMilliseconds to secondsYes■ Yes
CICode quality, security, and style complianceStatic AnalysisSecondsYes■ Yes
CIUI meets WCAG accessibility standardsStatic Analysis + ComponentSecondsYes■ Yes
Acceptance TestingDeployed artifact meets acceptance criteriaDeploy, Smoke, Load, Resilience, Compliance, etc.MinutesNo■ Yes - gates production
Post-deploy (production)Critical user journeys work in productionE2E smokeSeconds to minutesNoNo - triggers rollback
Post-deploy (production)Production health and SLOsSynthetic monitoringContinuousNoNo - triggers alerts
On demand/scheduledContract test doubles still match real external systemsIntegrationSeconds to minutesNoNo - triggers review
ContinuousUnexpected behavior, edge cases, real-world workflowsExploratory TestingVariesNoNever
ContinuousReal users can accomplish goals effectivelyUsability TestingVariesNoNever

The critical insight: everything that blocks merge is deterministic and under your control. Acceptance tests gate production promotion after verifying the deployed artifact. Everything that involves real external systems runs post-deployment. This is what gives you the independence to deploy any time, regardless of the state of the world around you.

Pre-merge vs post-merge

The table maps to two distinct phases of your pipeline, each with different goals and constraints.

Pre-merge (before code lands on trunk): Run unit, component, and contract tests. These must all be deterministic and fast. Target: under 10 minutes total. This is the quality gate that every change must pass. If pre-merge tests are slow, developers batch up changes or skip local runs, both of which undermine continuous integration.

Post-merge (after code lands on trunk, before or after deployment): Re-run the full deterministic suite against the integrated trunk. Then run acceptance tests, E2E smoke tests, and synthetic monitoring post-deploy. Integration tests run separately in a test environment, on demand or on a schedule. Target: under 60 minutes for the full post-merge cycle.

Why re-run pre-merge tests post-merge? Two changes can each pass pre-merge independently but conflict when combined on trunk. The post-merge run catches these integration effects.

If a post-merge failure occurs, the team fixes it immediately. Trunk must always be releasable.

This post-merge re-run is what teams traditionally call regression testing: running all previous tests against the current artifact to confirm that existing behavior still works after a change. In CD, regression testing is not a separate test type or a special suite. Every test in the pipeline is a regression test. The deterministic suite runs on every commit, and the full suite runs post-merge. If all tests pass, the artifact has been regression-tested.

good practices

Do

  • Run tests on every commit. If tests do not run automatically, they will be skipped.
  • Keep the deterministic suite under 10 minutes. If it is slower, developers will stop running it locally.
  • Fix broken tests immediately. A broken test is equivalent to a broken build.
  • Delete tests that do not provide value. A test that never fails and tests trivial behavior is maintenance cost with no benefit.
  • Test behavior, not implementation. Use a black box approach - verify what the code does, not how it does it. As Ham Vocke advises: “if I enter values x and y, will the result be z?” - not the sequence of internal calls that produce z. Avoid white box testing that asserts on internals.
  • Use test doubles for external dependencies. Your deterministic tests should run without network access to external systems.
  • Validate test doubles with contract tests. Test doubles that drift from reality give false confidence.
  • Treat test code as production code. Give it the same care, review, and refactoring attention.
  • Run automated accessibility checks on every commit. WCAG compliance scans are fast, deterministic, and catch violations that are invisible to sighted developers. Treat them like security scans: automate the detectable rules and reserve manual review for subjective judgment.

Do Not

  • Do not tolerate flaky tests. Quarantine or delete them immediately.
  • Do not gate your pipeline on non-deterministic tests. E2E and integration test failures should trigger review or alerts, not block deployment.
  • Do not couple your deployment to external system availability. If a third-party API being down prevents you from deploying, your test architecture has a critical gap.
  • Do not write tests after the fact as a checkbox exercise. Tests written without understanding the behavior they verify add noise, not value.
  • Do not test private methods directly. Test the public interface; private methods are tested indirectly.
  • Do not share mutable state between tests. Each test should set up and tear down its own state.
  • Do not use sleep/wait for timing-dependent tests. Use explicit waits, polling, or event-driven assertions.
  • Do not require a running database or external service for unit or component tests. That makes them integration or end-to-end tests - which is fine, but categorize them correctly and run them post-deployment, not as a pre-merge gate.
  • Do not make exploratory or usability testing a release gate. These activities are continuous and inform product direction; they are not a pass/fail checkpoint before deployment.

Additional concepts drawn from Ham Vocke, The Practical Test Pyramid, and Toby Clemson, Testing Strategies in a Microservice Architecture.

1 - Component Tests

Deterministic tests that verify a complete frontend component or backend service through its public interface, using test doubles for all external dependencies.
Component test pattern: a test actor hits the public interface of a component boundary. Inside the boundary, real internal modules (API Layer, Business Logic, Data Adapter) are wired together. Outside the boundary, a Database and External API are represented by test doubles.

Definition

A component test verifies a complete component - either a frontend component rendered in isolation, or a backend service exercised through its public interface - with test doubles replacing all external dependencies. No real databases, downstream services, or network calls leave the process. The test treats the component as a black box: inputs go in through the public interface (API endpoint, rendered UI), observable outputs come out, and the test asserts only on those outputs.

This is broader than a sociable unit test: where a sociable unit test allows in-process collaborators for a single behavior, a component test exercises the entire assembled component through its public interface.

The goal is to verify the assembled behavior of a component - that its modules, business logic, and interface layer work together correctly - without depending on any system the team does not control.

When to Use

  • You need to verify a complete user-facing feature from input to output within a single deployable unit.
  • You want to test how the UI, business logic, and data layer collaborate without depending on live external services or databases.
  • You need to simulate realistic user workflows (filling in forms, navigating pages, submitting API requests) while keeping the test fast and repeatable.
  • You are validating acceptance criteria for a user story and want a test that maps directly to the specified behavior.
  • You need to verify keyboard navigation, focus management, and screen reader announcements as part of feature verification.

If the test needs a real external dependency (live database, live downstream service), it is an end-to-end test. If it tests a single unit in isolation, it is a unit test.

Characteristics

PropertyValue
SpeedMilliseconds to seconds
DeterminismAlways deterministic
ScopeA complete frontend component or backend service
DependenciesAll external systems replaced with test doubles
NetworkLocalhost only
DatabaseNone or in-memory only
Breaks buildYes

Examples

Backend Service

A component test for a REST API, exercising the full application stack with the downstream inventory service replaced by a test double:

Backend component test - order creation with mocked inventory service
describe("POST /orders", () => {
  it("should create an order and return 201", async () => {
    // Arrange: mock the inventory service response
    httpMock("https://inventory.internal")
      .onGet("/stock/item-42")
      .reply(200, { available: true, quantity: 10 });

    // Act: send a request through the full application stack
    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    // Assert: verify the public interface response
    expect(response.status).toBe(201);
    expect(response.body.orderId).toBeDefined();
    expect(response.body.status).toBe("confirmed");
  });

  it("should return 409 when inventory is insufficient", async () => {
    httpMock("https://inventory.internal")
      .onGet("/stock/item-42")
      .reply(200, { available: true, quantity: 0 });

    const response = await request(app)
      .post("/orders")
      .send({ itemId: "item-42", quantity: 2 });

    expect(response.status).toBe(409);
    expect(response.body.error).toMatch(/insufficient/i);
  });
});

Frontend Component

A component test exercising a login flow with a mocked authentication service:

Frontend component test - login flow with mocked auth service
describe("Login page", () => {
  it("should redirect to the dashboard after successful login", async () => {
    mockAuthService.login.mockResolvedValue({ token: "abc123" });

    render(<App />);
    await userEvent.type(screen.getByLabelText("Email"), "ada@example.com");
    await userEvent.type(screen.getByLabelText("Password"), "s3cret");
    await userEvent.click(screen.getByRole("button", { name: "Sign in" }));

    expect(await screen.findByText("Dashboard")).toBeInTheDocument();
  });
});

Accessibility Verification

Component tests already exercise the UI from the actor’s perspective, making them the natural place to verify that interactions work for all users. Accessibility assertions fit alongside existing assertions rather than in a separate test suite.

Accessibility component test - keyboard navigation and WCAG assertions
// accessibility scanner setup

describe("Checkout flow", () => {
  it("should be completable using only the keyboard", async () => {
    render(<CheckoutPage />);

    await userEvent.tab();
    expect(screen.getByLabelText("Card number")).toHaveFocus();

    await userEvent.type(screen.getByLabelText("Card number"), "4111111111111111");
    await userEvent.tab();
    await userEvent.type(screen.getByLabelText("Expiry"), "12/27");
    await userEvent.tab();
    await userEvent.keyboard("{Enter}");

    expect(await screen.findByText("Order confirmed")).toBeInTheDocument();

    const results = await accessibilityScanner(document.body);
    expect(results).toHaveNoViolations();
  });
});

Anti-Patterns

  • Using live external services: making real network calls to external systems makes the test non-deterministic and slow. Replace everything outside the component boundary with test doubles.
  • Using a live database: a live database introduces ordering dependencies and shared state between tests. Use in-memory databases or mocked data layers.
  • Ignoring the actor’s perspective: component tests should interact with the system the way a user or API consumer would. Reaching into internal state or bypassing the public interface defeats the purpose.
  • Duplicating unit test coverage: component tests should focus on feature-level behavior and happy/critical paths. Leave exhaustive edge case and permutation testing to unit tests.
  • Slow test setup: if bootstrapping the component takes too long, invest in faster initialization (in-memory stores, lazy loading) rather than skipping component tests.
  • Deferring accessibility testing to manual audits: automated WCAG checks in component tests catch violations on every commit. Quarterly audits find problems that are weeks old.

Connection to CD Pipeline

Component tests run after unit tests in the pipeline and provide the broadest fast, deterministic feedback before code is promoted:

  1. Local development: run before committing. Deterministic scope keeps them fast enough to run locally without slowing the development loop.
  2. PR verification: CI executes the full suite; failures block merge.
  3. Trunk verification: the same tests run on the merged HEAD to catch conflicts.
  4. Pre-deployment gate: component tests can serve as the final deterministic gate before a build artifact is promoted.

Because component tests are deterministic, they should always break the build on failure. A healthy CD pipeline relies on a strong component test suite to verify assembled behavior - not just individual units - before any code reaches an environment with real dependencies.

2 - Contract Tests

Deterministic tests that verify interface boundaries with external systems using test doubles. Also called narrow integration tests. Validated by integration tests running against real systems.
Consumer-driven contract flow: consumer team runs a component test against a provider test double, generating a contract artifact. The provider team runs a verification step against the real service using the consumer contract. Both sides discover different things: consumers check for fields and types they depend on; providers check they have not broken any consumer.

Definition

A contract test (also called a narrow integration test) is a deterministic test that validates your code’s interaction with an external system’s interface using test doubles. It verifies that the boundary layer code - HTTP clients, database query layers, message producers - correctly handles the expected request/response shapes, field names, types, and status codes.

A contract test validates interface structure, not business behavior. It answers “does my code correctly interact with the interface I expect?” not “is the logic correct?” Business logic belongs in component tests.

Because contract tests use test doubles rather than live systems, they are deterministic and run on every commit as part of the pipeline. They block the build on failure, just like unit and component tests.

Integration tests validate that contract test doubles still match the real external systems by running against live dependencies post-deployment.

Consumer and Provider Perspectives

Every contract has two sides. The questions each side is trying to answer are different.

Consumer contract testing

The consumer is the service or component that depends on an external API. A consumer contract test asks:

“Do the fields I depend on still exist, in the types I expect, with the status codes I handle?”

Consumer tests assert only on the subset of the API the consumer actually uses - not everything the provider exposes. A consumer that only needs id and email from a user object should not assert on address or phone. This allows providers to add new fields freely without breaking consumers.

Following Postel’s Law - “be conservative in what you send, be liberal in what you accept”

  • consumer tests should accept any valid response that contains the fields they need, and tolerate fields they do not use.

What a consumer is trying to discover:

  • Has the provider changed or removed a field I depend on?
  • Has the provider changed a type I expect (string to integer, object to array)?
  • Has the provider changed a status code I handle?
  • Does the provider still accept the request format I send?

Provider contract testing

The provider is the service that owns the API. A provider contract test asks:

“Have my changes broken any of my consumers?”

A provider runs contract tests to verify that its API responses still satisfy the expectations of every known consumer. This gives early warning - before any consumer deploys and discovers the breakage - that a change is breaking.

What a provider is trying to discover:

  • Have I removed or renamed a field that a consumer depends on?
  • Have I changed a type in a way that breaks deserialization for a consumer?
  • Have I changed error behavior (status codes, error formats) that consumers handle?
  • Is my API still backward compatible with all published consumer expectations?

Approaches to Contract Testing

Consumer-driven contract development

In consumer-driven contracts (CDC), the consumer writes the contract. The consumer defines their expectations as executable tests - what request they will send and what response shape they require. These expectations are published to a shared contract broker and the provider runs them as part of their own build.

The flow:

  1. Consumer team writes tests defining their expectations against a mock provider.
  2. The consumer tests generate a contract artifact.
  3. The contract is published to a shared contract broker.
  4. The provider team runs the consumer’s contract expectations against their real implementation.
  5. If the provider’s implementation satisfies the contract, the provider can deploy with confidence it will not break this consumer. If not, the teams negotiate before merging the breaking change.

CDC works well for evolving systems: it grounds the API design in actual consumer needs rather than the provider’s assumptions about what consumers will use.

Contract-first development

In contract-first development, the interface is defined as a formal artifact - an OpenAPI specification, a Protobuf schema, an Avro schema, or similar - before any implementation is written. Both the consumer and provider code are generated from or validated against that artifact.

The flow:

  1. Teams agree on the interface contract (usually during design or story refinement).
  2. The contract is committed to version control.
  3. Consumer and provider teams develop independently, each generating or validating their code against the contract.
  4. Tests on both sides verify conformance to the contract - not to each other’s implementation.

Contract-first works well for new APIs and parallel development: it lets consumer and provider teams work simultaneously without waiting for a real implementation, and makes the interface an explicit design decision rather than an emergent one.

Choosing between them

SituationPrefer
Existing API with multiple consumers, evolving over timeConsumer-driven (CDC)
New API, teams working in parallelContract-first
Third-party API you do not controlConsumer-only contract tests (no provider side)
Public API with external consumers you cannot reachProvider tests against published spec

The two approaches are not mutually exclusive. A team may define an initial contract-first schema and then adopt CDC tooling as the number of consumers grows.

Characteristics

PropertyValue
SpeedMilliseconds to seconds
DeterminismAlways deterministic (uses test doubles)
ScopeInterface boundary between two systems
DependenciesAll replaced with test doubles
NetworkNone or localhost only
DatabaseNone
Breaks buildYes

Examples

A consumer contract test using a consumer-driven contract tool:

Consumer contract test - order service consuming inventory API
describe("Order Service - Inventory Provider Contract", () => {
  it("should receive stock availability in the expected format", async () => {
    // Define what the consumer expects from the provider
    await contractTool.addInteraction({
      state: "item-42 is in stock",
      uponReceiving: "a request for item-42 stock",
      withRequest: { method: "GET", path: "/stock/item-42" },
      willRespondWith: {
        status: 200,
        body: {
          // Only assert on fields the consumer actually uses
          available: matchType(true),   // boolean
          quantity: matchType(10),      // integer
        },
      },
    });

    // Exercise the consumer code against the mock provider
    const result = await inventoryClient.checkStock("item-42");
    expect(result.available).toBe(true);
  });
});

A provider verification test that runs consumer expectations against the real implementation:

Provider verification - running consumer contracts against the real API
describe("Inventory Service - Provider Verification", () => {
  it("should satisfy all registered consumer contracts", async () => {
    await contractBroker.verifyProvider({
      provider: "InventoryService",
      providerBaseUrl: "http://localhost:3001",
      brokerUrl: "https://contract-broker.internal",
      providerVersion: process.env.GIT_SHA,
    });
  });
});

A contract-first schema validation test verifying a provider response against an OpenAPI spec:

Contract-first test - OpenAPI schema validation
describe("GET /stock/:id - OpenAPI contract", () => {
  it("should return a response conforming to the published schema", async () => {
    const response = await fetch("http://localhost:3001/stock/item-42");
    const body = await response.json();

    // Validate against the OpenAPI schema, not specific values
    expect(response.status).toBe(200);
    expect(typeof body.available).toBe("boolean");
    expect(typeof body.quantity).toBe("number");
    // Additional fields the consumer does not use are not asserted on
  });
});

Anti-Patterns

  • Asserting on business logic: contract tests verify structure, not behavior. A contract test that asserts quantity > 0 when in stock is crossing into business logic territory. That belongs in component tests.
  • Asserting on fields the consumer does not use: over-specified consumer contracts make providers brittle. Only assert on what your code actually reads.
  • Testing specific data values: asserting that name equals "Alice" makes the test brittle. Assert on types, required fields, and status codes instead.
  • Hitting live systems in contract tests: contract tests must use test doubles to stay deterministic. Validating doubles against live systems is the role of integration tests, which run post-deployment.
  • Running infrequently: contract tests should run often enough to catch drift before it causes a production incident. High-volatility APIs may need hourly runs.
  • Skipping provider verification in CDC: publishing consumer expectations is only half the pattern. The provider must actually run those expectations for CDC to work.

Connection to CD Pipeline

Contract tests run on every commit as part of the deterministic pipeline:

Contract tests in the pipeline
On every commit          Unit tests              Deterministic    Blocks
                         Component tests         Deterministic    Blocks
                         Contract tests          Deterministic    Blocks

Post-deployment          Integration tests       Non-deterministic   Validates contract doubles
                         E2E smoke tests         Non-deterministic   Triggers rollback

Contract tests verify that your boundary layer code correctly interacts with the interfaces you depend on. Integration tests validate that those test doubles still match the real external systems by running against live dependencies post-deployment.

3 - End-to-End Tests

Tests that exercise two or more real components up to the full system. Non-deterministic by nature; never a pre-merge gate.
End-to-end test scope spectrum. Narrow scope: a test drives a real service that calls a real database. Full-system scope: a browser drives a real frontend, which calls a real backend, which calls a real database. All components are real at every scope - no test doubles.

Definition

An end-to-end test exercises real components working together - no test doubles replace the dependencies under test. The scope ranges from two services calling each other, to a service talking to a real database, to a complete user journey through every layer of the system.

The defining characteristic is that real external dependencies are present: actual databases, live downstream services, real message brokers, or third-party APIs. Because those dependencies introduce timing, state, and availability factors outside the test’s control, end-to-end tests are typically non-deterministic. They fail for reasons unrelated to code correctness - network instability, service unavailability, test data collisions, or third-party rate limits.

Terminology note

“Integration test” and “end-to-end test” are often used interchangeably in the industry. Martin Fowler distinguishes between narrow integration tests (which use test doubles at the boundary - what this site calls contract tests) and broad integration tests (which use real dependencies). This site treats them as distinct categories: integration tests validate that contract test doubles still match the real external systems, while end-to-end tests exercise user journeys or multi-service flows through real systems.

Scope

End-to-end tests cover a spectrum based on how many components are real:

ScopeExample
NarrowA service making real calls to a real database
Service-to-serviceOrder service calling the real inventory service
Multi-serviceA user journey spanning three live services
Full systemA browser test through a staging environment with all dependencies live

All of these involve real external dependencies. All share the same fundamental non-determinism risk. Use the narrowest scope that gives you the confidence you need.

When to Use

Use end-to-end tests sparingly. They are the most expensive test type to write, run, and maintain. Use them for:

  • Smoke testing a deployed environment to verify that key integrations are functioning after a deployment.
  • Happy-path validation of critical business flows that cannot be verified any other way (e.g., a payment flow that depends on a real payment provider).
  • Cross-team workflows that span multiple deployables and cannot be isolated within a single component test.

Do not use end-to-end tests to cover edge cases, error handling, or input validation. Those scenarios belong in unit or component tests, which are faster, cheaper, and deterministic.

Vertical vs. horizontal

Vertical end-to-end tests target features owned by a single team:

  • An order is created and the confirmation email is sent.
  • A user uploads a file and it appears in their document list.

Horizontal end-to-end tests span multiple teams:

  • A user navigates from homepage through search, product detail, cart, and checkout.

Horizontal tests have a large failure surface and are significantly more fragile. They are not suitable for blocking the pipeline; run them on a schedule and review failures out of band.

Characteristics

PropertyValue
SpeedSeconds to minutes per test
DeterminismTypically non-deterministic
ScopeTwo or more real components, up to the full system
DependenciesReal services, databases, brokers, third-party APIs
NetworkFull network access
DatabaseLive databases
Breaks buildNo - triggers review or rollback, not a pre-merge gate

Examples

A narrow end-to-end test verifying a service against a real database:

Narrow E2E - order service against a real database
describe("OrderRepository (real database)", () => {
  it("should persist and retrieve an order by ID", async () => {
    const order = await orderRepository.create({
      itemId: "item-42",
      quantity: 2,
      customerId: "cust-99",
    });

    const retrieved = await orderRepository.findById(order.id);
    expect(retrieved.itemId).toBe("item-42");
    expect(retrieved.status).toBe("pending");
  });
});

A full-system browser test using a browser automation framework:

Full-system E2E - add to cart and checkout with browser automation
test("user can add an item to cart and check out", async ({ page }) => {
  await page.goto("https://staging.example.com");
  await page.getByRole("link", { name: "Running Shoes" }).click();
  await page.getByRole("button", { name: "Add to Cart" }).click();

  await page.getByRole("link", { name: "Cart" }).click();
  await expect(page.getByText("Running Shoes")).toBeVisible();

  await page.getByRole("button", { name: "Checkout" }).click();
  await expect(page.getByText("Order confirmed")).toBeVisible();
});

Anti-Patterns

  • Using end-to-end tests as the primary safety net: this is the ice cream cone anti-pattern. The majority of your confidence should come from unit and component tests, which are fast and deterministic. End-to-end tests are expensive insurance for the gaps.
  • Blocking the pipeline: end-to-end tests must never be a pre-merge gate. Their non-determinism will eventually block a deploy for reasons unrelated to code quality.
  • Blocking on horizontal tests: horizontal tests span too many teams and failure surfaces. Run them on a schedule and review failures as a team.
  • Ignoring flaky failures: track frequency and root cause. A test that fails for environmental reasons is not providing a code quality signal - fix it or remove it.
  • Testing edge cases here: exhaustive permutation testing in end-to-end tests is slow, expensive, and duplicates what unit and component tests should cover.
  • Not capturing failure context: end-to-end failures are expensive to debug. Capture screenshots, network logs, and video recordings automatically on failure.

Connection to CD Pipeline

End-to-end tests run after deployment, not before:

E2E tests in the pipeline
Stage 1 (every commit)    Unit tests              Deterministic    Blocks
                          Component tests         Deterministic    Blocks
                          Contract tests          Deterministic    Blocks

Post-deployment           Integration tests       Non-deterministic   Validates contract doubles
                          E2E smoke tests         Non-deterministic   Triggers rollback
                          Scheduled E2E suites    Non-deterministic   Review out of band
                          Synthetic monitoring    Non-deterministic   Triggers alerts

A team may choose to gate on a small, highly reliable set of vertical end-to-end smoke tests immediately after deployment. This is acceptable only if the team invests in keeping those tests stable. A flaky smoke gate is worse than no gate: it trains developers to ignore failures.

Use contract tests to verify that the test doubles in your component tests still match reality. This gives you deterministic pre-merge confidence without depending on live external systems.

4 - Test Feedback Speed

Why test suite speed matters for developer effectiveness and how cognitive limits set the targets.

Why speed has a threshold

The 10-minute CI target and the preference for sub-second unit tests are not arbitrary. They come from how human cognition handles interrupted work. When a developer makes a change and waits for test results, three things determine whether that feedback is useful: whether the developer still holds the mental model of the change, whether they can act on the result immediately, and whether the wait is short enough that they do not context-switch to something else.

Research on task interruption and working memory consistently shows that context switches are expensive. Gloria Mark’s research at UC Irvine found that it takes an average of 23 minutes for a person to fully regain deep focus after being interrupted during a task, and that interrupted tasks take twice as long and contain twice as many errors as uninterrupted ones.1 If the test suite itself takes 30 minutes, the total cost of a single feedback cycle approaches an hour - and most of that time is spent re-loading context, not fixing code.

The cognitive breakpoints

Jakob Nielsen’s foundational research on response times identified three thresholds that govern how users perceive and respond to system delays: 0.1 seconds (feels instantaneous), 1 second (noticeable but flow is maintained), and 10 seconds (attention limit - the user starts thinking about other things).2 These thresholds, rooted in human perceptual and cognitive limits, apply directly to developer tooling.

Different feedback speeds produce fundamentally different developer behaviors:

Feedback timeDeveloper behaviorCognitive impact
Under 1 secondFeels instantaneous. The developer stays in flow, treating the test result as part of the editing cycle.2Working memory is fully intact. The change and the result are experienced as a single action.
1 to 10 secondsThe developer waits. Attention may drift briefly but returns without effort.Working memory is intact. The developer can act on the result immediately.
10 seconds to 2 minutesThe developer starts to feel the wait. They may glance at another window or check a message, but they do not start a new task.Working memory begins to decay. The developer can still recover context quickly, but each additional second increases the chance of distraction.2
2 to 10 minutesThe developer context-switches. They check email, review a PR, or start thinking about a different problem. When the result arrives, they must actively return to the original task.Working memory is partially lost. Rebuilding context takes several minutes depending on the complexity of the change.1
Over 10 minutesThe developer fully disengages and starts a different task. The test result arrives as an interruption to whatever they are now doing.Working memory of the original change is gone. Rebuilding it takes upward of 23 minutes.1 Investigating a failure means re-reading code they wrote an hour ago.

The 10-minute CI target exists because it is the boundary between “developer waits and acts on the result” and “developer starts something else and pays a full context-switch penalty.” Below 10 minutes, feedback is actionable. Above 10 minutes, feedback becomes an interruption. DORA’s research on continuous integration reinforces this: tests should complete in under 10 minutes to support the fast feedback loops that high-performing teams depend on.3

What this means for test architecture

These cognitive breakpoints should drive how you structure your test suite:

Local development (under 1 second). Unit tests for the code you are actively changing should run in watch mode, re-executing on every save. At this speed, TDD becomes natural - the test result is part of the writing process, not a separate step. This is where you test complex logic with many permutations.

Pre-push verification (under 2 minutes). The full unit test suite and the component tests for the component you changed should complete before you push. At this speed, the developer stays engaged and acts on failures immediately. This is where you catch regressions.

CI pipeline (under 10 minutes). The full deterministic suite - all unit tests, all component tests, all contract tests - should complete within 10 minutes of commit. At this speed, the developer has not yet fully disengaged from the change. If CI fails, they can investigate while the code is still fresh.

Post-deploy verification (minutes to hours). E2E smoke tests and integration test validation run after deployment. These are non-deterministic, slower, and less frequent. Failures at this level trigger investigation, not immediate developer action.

When a test suite exceeds 10 minutes, the solution is not to accept slower feedback. It is to redesign the suite: replace E2E tests with component tests using test doubles, parallelize test execution, and move non-deterministic tests out of the gating path.

Impact on application architecture

Test feedback speed is not just a testing concern - it puts pressure on how you design your systems. A monolithic application with a single test suite that takes 40 minutes to run forces every developer to pay the full context-switch penalty on every change, regardless of which module they touched.

Breaking a system into smaller, independently testable components is often motivated as much by test speed as by deployment independence. When a component has its own focused test suite that runs in under 2 minutes, the developer working on that component gets fast, relevant feedback. They do not wait for tests in unrelated modules to finish.

This creates a virtuous cycle: smaller components with clear boundaries produce faster test suites, which enable more frequent integration, which encourages smaller changes, which are easier to test. Conversely, a tightly coupled monolith produces a slow, tangled test suite that discourages frequent integration, which leads to larger changes, which are harder to test and more likely to fail.

Architecture decisions that improve test feedback speed include:

  • Clear component boundaries with well-defined interfaces, so each component can be tested in isolation with test doubles for its dependencies.
  • Separating business logic from infrastructure so that core rules can be unit tested in milliseconds without databases, queues, or network calls.
  • Independently deployable services with their own test suites, so a change to one service does not require running the entire system’s tests.
  • Avoiding shared mutable state between components, which forces integration tests and introduces non-determinism.

If your test suite is slow and you cannot make it faster by optimizing test execution alone, the architecture is telling you something. A system that is hard to test quickly is also hard to change safely - and both problems have the same root cause.

The compounding cost of slow feedback

Slow feedback does not just waste time - it changes behavior. When the suite takes 40 minutes, developers adapt:

  • They batch changes to avoid running the suite more than necessary, creating larger and riskier commits.
  • They stop running tests locally because the wait is unacceptable during active development.
  • They push to CI and context-switch, paying the full rebuild penalty on every cycle.
  • They rerun failures instead of investigating, because re-reading the code they wrote an hour ago is expensive enough that “maybe it was flaky” feels like a reasonable bet.

Each of these behaviors degrades quality independently. Together, they make continuous integration impossible. A team that cannot get feedback on a change within 10 minutes cannot sustain the practice of integrating changes multiple times per day.4

Sources

Further reading


  1. Gloria Mark, Daniela Gudith, and Ulrich Klocke, “The Cost of Interrupted Work: More Speed and Stress,” Proceedings of CHI 2008, ACM. ↩︎ ↩︎ ↩︎

  2. Jakob Nielsen, “Response Times: The 3 Important Limits,” Nielsen Norman Group, 1993 (updated 2014). Based on research originally published in Miller 1968 and Card et al. 1991. ↩︎ ↩︎ ↩︎

  3. “Continuous Integration,” DORA capabilities research, Google Cloud. ↩︎

  4. Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and DevOps, IT Revolution Press, 2018. ↩︎

5 - Integration Tests

Tests that exercise real external dependencies to validate that contract test doubles still match reality. Non-deterministic; never a pre-merge gate.

“Integration test” is widely used but inconsistently defined. On this site, integration tests are tests that involve real external dependencies - actual databases, live downstream services, real message brokers, or third-party APIs. They are non-deterministic because those dependencies introduce timing, state, and availability factors outside the test’s control.

Integration tests serve a specific role in the test architecture: they validate that the test doubles used in your contract tests still match reality. Without integration tests, contract test doubles can silently drift from the real behavior of the systems they simulate - giving false confidence.

Because integration tests depend on live systems, they run post-deployment or on a schedule - never as a pre-merge gate. Failures trigger review or rollback decisions, not build failures.

For tests that validate interface boundaries using test doubles (deterministic), see Contract Tests.

For full-system browser tests and multi-service smoke tests, see End-to-End Tests.

6 - Static Analysis

Code analysis tools that evaluate non-running code for security vulnerabilities, complexity, and best practice violations.

Definition

Static analysis (also called static testing) evaluates non-running code against rules for known good practices. Unlike other test types that execute code and observe behavior, static analysis inspects source code, configuration files, and dependency manifests to detect problems before the code ever runs.

Static analysis serves several key purposes:

  • Catches errors that would otherwise surface at runtime.
  • Warns of excessive complexity that degrades the ability to change code safely.
  • Identifies security vulnerabilities and coding patterns that provide attack vectors.
  • Enforces coding standards by removing subjective style debates from code reviews.
  • Alerts to dependency issues such as outdated packages, known CVEs, license incompatibilities, or supply-chain compromises.

When to Use

Static analysis should run continuously, at every stage where feedback is possible:

  • In the IDE: real-time feedback as developers type, via editor plugins and language server integrations.
  • On save: format-on-save and lint-on-save catch issues immediately.
  • Pre-commit: hooks prevent problematic code from entering version control.
  • In CI: the full suite of static checks runs on every PR and on the trunk after merge, verifying that earlier local checks were not bypassed.

Static analysis is always applicable. Every project, regardless of language or platform, benefits from linting, formatting, and dependency scanning.

Characteristics

PropertyValue
SpeedSeconds (typically the fastest test category)
DeterminismAlways deterministic
ScopeEntire codebase (source, config, dependencies)
DependenciesNone (analyzes code at rest)
NetworkNone (except dependency scanners)
DatabaseNone
Breaks buildYes

Examples

Linting

A .eslintrc.json configuration enforcing test quality rules:

Linter configuration for test quality rules
{
  "rules": {
    "no-disabled-tests": "warn",
    "require-assertions": "error",
    "no-commented-out-tests": "error",
    "valid-assertions": "error",
    "no-unused-vars": "error",
    "no-console": "warn"
  }
}

Type Checking

Statically typed languages catch type mismatches at compile time, eliminating entire classes of runtime errors. Java, for example, rejects incompatible argument types before the code runs:

Java type checking example
public static double calculateTotal(double price, int quantity) {
    return price * quantity;
}

// Compiler error: incompatible types: String cannot be converted to double
calculateTotal("19.99", 3);

Dependency Scanning

Dependency scanning tools scan for known vulnerabilities:

npm audit output example
$ npm audit
found 2 vulnerabilities (1 moderate, 1 high)
  moderate: Prototype Pollution in lodash < 4.17.21
  high:     Remote Code Execution in log4j < 2.17.1

Types of Static Analysis

TypePurpose
LintingCatches common errors and enforces good practices
FormattingEnforces consistent code style, removing subjective debates
Complexity analysisFlags overly deep or long code blocks that breed defects
Type checkingPrevents type-related bugs, replacing some unit tests
Security scanningDetects known vulnerabilities and dangerous coding patterns
Dependency scanningChecks for outdated, hijacked, or insecurely licensed deps
Accessibility lintingDetects missing alt text, ARIA violations, contrast failures, semantic HTML issues

Accessibility Linting

Accessibility linting catches deterministic WCAG violations the same way a security scanner catches known vulnerability patterns. Automated checks cover structural issues (missing alt text, invalid ARIA attributes, insufficient contrast ratios, broken heading hierarchy) while manual review covers subjective aspects like whether alt text is actually meaningful.

An accessibility checker configuration running WCAG 2.1 AA checks against rendered pages:

Accessibility checker configuration for WCAG 2.1 AA
{
  "defaults": {
    "standard": "WCAG2AA",
    "timeout": 10000,
    "wait": 1000
  },
  "urls": [
    "http://localhost:1313/docs/",
    "http://localhost:1313/docs/testing/"
  ]
}

An accessibility scanner test asserting that a rendered component has no violations:

Accessibility scanner test verifying no WCAG violations
// accessibility scanner setup (e.g. import scanner and extend assertions)

it("should have no accessibility violations", async () => {
  const { container } = render(<LoginForm />);
  const results = await accessibilityScanner(container);
  expect(results).toHaveNoViolations();
});

Anti-Patterns

  • Disabling rules instead of fixing code: suppressing linter warnings or ignoring security findings erodes the value of static analysis over time.
  • Not customizing rules: default rulesets are a starting point. Write custom rules for patterns that come up repeatedly in code reviews.
  • Running static analysis only in CI: by the time CI reports a formatting error, the developer has context-switched. IDE plugins and pre-commit hooks provide immediate feedback.
  • Ignoring dependency vulnerabilities: known CVEs in dependencies are a direct attack vector. Treat high-severity findings as build-breaking.
  • Treating static analysis as optional: static checks should be mandatory and enforced. If developers can bypass them, they will.

Connection to CD Pipeline

Static analysis is the first gate in the CD pipeline, providing the fastest feedback:

  1. IDE / local development: plugins run in real time as code is written.
  2. Pre-commit: hooks run linters, formatters, and accessibility checks on changed components, blocking commits that violate rules.
  3. PR verification: CI runs the full static analysis suite (linting, type checking, security scanning, dependency auditing, accessibility linting) and blocks merge on failure.
  4. Trunk verification: the same checks re-run on the merged HEAD to catch anything missed.
  5. Scheduled scans: dependency and security scanners run on a schedule to catch newly disclosed vulnerabilities in existing dependencies.

Because static analysis requires no running code, no test environment, and no external dependencies, it is the cheapest and fastest form of quality verification. A mature CD pipeline treats static analysis failures the same as test failures: they break the build.

7 - Test Doubles

Patterns for isolating dependencies in tests: stubs, mocks, fakes, spies, and dummies.

Definition

Test doubles are stand-in objects that replace real production dependencies during testing. The term comes from the film industry’s “stunt double.” Just as a stunt double replaces an actor for dangerous scenes, a test double replaces a costly or non-deterministic dependency to make tests fast, isolated, and reliable.

Test doubles allow you to:

  • Remove non-determinism by replacing network calls, databases, and file systems with predictable substitutes.
  • Control test conditions by forcing specific states, error conditions, or edge cases that would be difficult to reproduce with real dependencies.
  • Increase speed by eliminating slow I/O operations.
  • Isolate the system under test so that failures point directly to the code being tested, not to an external dependency.

Types of Test Doubles

TypeDescriptionExample Use Case
DummyPassed around but never actually used. Fills parameter lists.A required logger parameter in a constructor.
StubProvides canned answers to calls made during the test. Does not respond to anything outside what is programmed.Returning a fixed user object from a repository.
SpyA stub that also records information about how it was called (arguments, call count, order).Verifying that an analytics event was sent once.
MockPre-programmed with expectations about which calls will be made. Verification happens on the mock itself.Asserting that sendEmail() was called with specific arguments.
FakeHas a working implementation, but takes shortcuts not suitable for production.An in-memory database replacing PostgreSQL.

Choosing the Right Double

  • Use stubs when you need to supply data but do not care how it was requested.
  • Use spies when you need to verify call arguments or call count.
  • Use mocks when the interaction itself is the primary thing being verified.
  • Use fakes when you need realistic behavior but cannot use the real system.
  • Use dummies when a parameter is required by the interface but irrelevant to the test.

When to Use

Test doubles are used in every layer of deterministic testing:

  • Unit tests: nearly all dependencies are replaced with test doubles to achieve full isolation.
  • Component tests: all dependencies that cross the component boundary (external APIs, databases, downstream services) are replaced to maintain determinism.

Test doubles should be used less in later pipeline stages. End-to-end tests use no test doubles by design.

Examples

A JavaScript stub providing a canned response:

JavaScript stub returning a fixed user
// Stub: return a fixed user regardless of input
const userRepository = {
  findById: stub().returns(Promise.resolve({
    id: "u1",
    name: "Ada Lovelace",
    email: "ada@example.com",
  })),
};

const user = await userService.getUser("u1");
expect(user.name).toBe("Ada Lovelace");

A Java spy verifying interaction:

Java spy verifying call count with a mocking framework
@Test
public void shouldCallUserServiceExactlyOnce() {
    UserService spyService = spy(userService);
    doReturn(testUser).when(spyService).getUserInfo("u123");

    User result = spyService.getUserInfo("u123");

    verify(spyService, times(1)).getUserInfo("u123");
    assertEquals("Ada", result.getName());
}

A fake in-memory repository:

JavaScript fake in-memory repository
class FakeUserRepository {
  constructor() {
    this.users = new Map();
  }
  save(user) {
    this.users.set(user.id, user);
  }
  findById(id) {
    return this.users.get(id) || null;
  }
}

Anti-Patterns

  • Mocking what you do not own: wrapping a third-party API in a thin adapter and mocking the adapter is safer than mocking the third-party API directly. Direct mocks couple your tests to the library’s implementation.
  • Over-mocking: replacing every collaborator with a mock turns the test into a mirror of the implementation. Tests become brittle and break on every refactor. Only mock what is necessary to maintain determinism.
  • Not validating test doubles: if the real dependency changes its contract, your test doubles silently drift. Use contract tests to keep doubles honest.
  • Complex mock setup: if setting up mocks requires dozens of lines, the system under test may have too many dependencies. Consider refactoring the production code rather than adding more mocks.
  • Using mocks to test implementation details: asserting on the exact sequence and count of internal method calls creates change-detector tests. Prefer asserting on observable output.

Connection to CD Pipeline

Test doubles are a foundational technique that enables the fast, deterministic tests required for continuous delivery:

  • Early pipeline stages (static analysis, unit tests, component tests, contract tests) rely heavily on test doubles to stay fast and deterministic. This is where the majority of defects are caught.
  • Later pipeline stages (integration tests, E2E tests, production monitoring) use fewer or no test doubles, trading speed for realism.
  • Integration tests run post-deployment to validate that the test doubles used in contract tests still match the real external systems.

The guiding principle from Justin Searls applies: “Don’t poke too many holes in reality.” Use test doubles when you must, but prefer real implementations when they are fast and deterministic.

8 - Unit Tests

Fast, deterministic tests that verify a unit of behavior through its public interface, asserting on what the code does rather than how it works.
Solitary unit test: test actor sends input to a Unit Under Test; all collaborators are replaced by test doubles. Sociable unit test: test actor sends input to a Unit Under Test that uses real in-process collaborators; only external I/O is replaced by a test double.

Definition

A unit test is a deterministic test that exercises a unit of behavior (a single meaningful action or decision your code makes) and verifies that the observable outcome is correct. The “unit” is not a function, method, or class. It is a behavior: given these inputs, the system produces this result. A single behavior may involve one function or several collaborating objects. What matters is that the test treats the code as a black box and asserts only on what it produces, not on how it produces it.

All external dependencies are replaced with test doubles so the test runs quickly and produces the same result every time.

Solitary vs. sociable unit tests

A solitary unit test replaces all collaborators with test doubles and exercises a single class or function in complete isolation.

A sociable unit test allows real in-process collaborators to participate - for example, a service object calling a real domain model - while still replacing any external I/O (network, database, file system) with test doubles. Both styles are unit tests as long as no real external dependency is involved.

When the scope expands to an entire frontend component or a complete backend service exercised through its public API, that is a component test.

White box testing (asserting on internal method calls, call order, or private state) creates change-detector tests that break during routine refactoring without catching real defects. Prefer testing through the public interface (methods, APIs, exported functions) and asserting on return values, state changes visible to consumers, or observable side effects.

The purpose of unit tests is to:

  • Verify that a unit of behavior produces the correct observable outcome.
  • Cover high-complexity logic where many input permutations exist, such as business rules, calculations, and state transitions.
  • Keep cyclomatic complexity visible and manageable through good separation of concerns.

When to Use

  • During development: run the relevant subset of unit tests continuously while writing code. TDD (Red-Green-Refactor) is the most effective workflow.
  • On every commit: use pre-commit hooks or watch-mode test runners so broken tests never reach the remote repository.
  • In CI: execute the full unit test suite on every pull request and on the trunk after merge to verify nothing was missed locally.

Unit tests are the right choice when the behavior under test can be exercised without network access, file system access, or database connections. If you need any of those, you likely need a component test or an end-to-end test instead.

Characteristics

PropertyValue
SpeedMilliseconds per test
DeterminismAlways deterministic
ScopeA single unit of behavior
DependenciesAll replaced with test doubles
NetworkNone
DatabaseNone
Breaks buildYes

Examples

A JavaScript unit test verifying a pure utility function:

JavaScript unit test for castArray utility
// castArray.test.js
describe("castArray", () => {
  it("should wrap non-array items in an array", () => {
    expect(castArray(1)).toEqual([1]);
    expect(castArray("a")).toEqual(["a"]);
    expect(castArray({ a: 1 })).toEqual([{ a: 1 }]);
  });

  it("should return array values by reference", () => {
    const array = [1];
    expect(castArray(array)).toBe(array);
  });

  it("should return an empty array when no arguments are given", () => {
    expect(castArray()).toEqual([]);
  });
});

A Java unit test using a mocking framework to isolate the system under test:

Java unit test with mocking framework stub isolating the controller
@Test
public void shouldReturnUserDetails() {
    // Arrange
    User mockUser = new User("Ada", "Engineering");
    when(userService.getUserInfo("u123")).thenReturn(mockUser);

    // Act
    User result = userController.getUser("u123");

    // Assert
    assertEquals("Ada", result.getName());
    assertEquals("Engineering", result.getDepartment());
}

Anti-Patterns

  • White box testing: asserting on internal state, call order, or private method behavior rather than observable output. These change-detector tests break during refactoring without catching real defects. Test through the public interface instead.
  • Testing private methods: private implementations are meant to change. They are exercised indirectly through the behavior they support. Test the public interface instead.
  • No assertions: a test that runs code without asserting anything provides false confidence. Lint rules can catch this automatically.
  • Disabling or skipping tests: skipped tests erode confidence over time. Fix or remove them.
  • Confusing “unit” with “function”: a unit of behavior may span multiple collaborating objects. Forcing one-test-per-function creates brittle tests that mirror the implementation structure rather than verifying meaningful outcomes.
  • Ice cream cone testing: relying primarily on slow E2E tests while neglecting fast unit tests inverts the test pyramid and slows feedback.
  • Chasing coverage numbers: gaming coverage metrics (e.g., running code paths without meaningful assertions) creates a false sense of confidence. Focus on behavior coverage instead.

Connection to CD Pipeline

Unit tests occupy the base of the test pyramid. They run in the earliest stages of the CD pipeline and provide the fastest feedback loop:

  1. Local development: watch mode reruns tests on every save.
  2. Pre-commit: hooks run the suite before code reaches version control.
  3. PR verification: CI runs the full suite and blocks merge on failure.
  4. Trunk verification: CI reruns tests on the merged HEAD to catch integration issues.

Because unit tests are fast and deterministic, they should always break the build on failure. A healthy CD pipeline depends on a large, reliable suite of black box unit tests that verify behavior rather than implementation, giving developers the confidence to refactor freely and ship small changes frequently.

9 - Testing Glossary

Definitions for testing terms as they are used on this site.

These definitions reflect how this site uses each term. They are not universal definitions - other communities may use the same words differently.

Component Test

A deterministic test that verifies a complete frontend component or backend service through its public interface, with test doubles for all external dependencies. See Component Tests for full definition and examples.

Referenced in: Component Tests, End-to-End Tests, Tests Randomly Pass or Fail, Unit Tests

Black Box Testing

A testing approach where the test exercises code through its public interface and asserts only on observable outputs - return values, state changes visible to consumers, or side effects such as messages sent. The test has no knowledge of internal implementation details. Black box tests are resilient to refactoring because they verify what the code does, not how it does it. Contrast with white box testing.

Referenced in: CD Testing, Unit Tests

Acceptance Tests

Automated tests that verify a system behaves as specified. Acceptance tests exercise user workflows in a production-like environment and confirm the implementation matches the acceptance criteria. They answer “did we build what was specified?” rather than “does the code work?” They do not validate whether the specification itself is correct - only real user feedback can confirm we are building the right thing.

In CD, acceptance testing is a pipeline stage, not a single test type. It can include component tests, load tests, chaos tests, resilience tests, and compliance tests. Any test that runs after CI to gate promotion to production is an acceptance test.

Referenced in: CD Testing, Pipeline Reference Architecture

Sociable Unit Test

A unit test that allows real in-process collaborators to participate - for example, a service object calling a real domain model or value object - while still replacing any external I/O (network, database, file system) with test doubles. The “unit” being tested is a behavior that spans multiple in-process objects. When the scope expands to the entire public interface of a frontend component or backend service, that is a component test.

Referenced in: Unit Tests, Component Tests

Solitary Unit Test

A unit test that replaces all collaborators with test doubles and exercises a single class or function in complete isolation. Contrast with sociable unit test, which allows real in-process collaborators while still replacing external I/O.

Referenced in: Unit Tests

TDD (Test-Driven Development)

A development practice where tests are written before the production code that makes them pass. TDD supports CD by ensuring high test coverage, driving simple design, and producing a fast, reliable test suite. TDD feeds into the testing fundamentals required in Phase 1.

Referenced in: CD for Greenfield Projects, Integration Frequency, Inverted Test Pyramid, Small Batches, TBD Migration Guide, Trunk-Based Development, Unit Tests

Synthetic Monitoring

Automated scripts that continuously execute realistic user journeys or API calls against a live production (or production-like) environment and alert when those journeys fail or degrade. Unlike passive monitoring that watches for errors in real user traffic, synthetic monitoring proactively simulates user behavior on a schedule - so problems are detected even during low traffic periods. Synthetic monitors are non-deterministic (they depend on live external systems) and are never a pre-merge gate. Failures trigger alerts or rollback decisions, not build blocks.

Referenced in: Architecting Tests for CD, End-to-End Tests

Virtual Service

A test double that simulates a real external service over the network, responding to HTTP requests with pre-configured or recorded responses. Unlike in-process stubs or mocks, a virtual service runs as a standalone process and is accessed via real network calls, making it suitable for component testing and end-to-end testing where your application needs to make actual HTTP requests against a dependency. Service virtualization tools can create virtual services from recorded traffic or API specifications. See Test Doubles.

Referenced in: Component Tests, End-to-End Tests, Testing Fundamentals

White Box Testing

A testing approach where the test has knowledge of and asserts on internal implementation details - specific methods called, call order, internal state, or code paths taken. White box tests verify how the code works, not what it produces. These tests are fragile because any refactoring of internals breaks them, even when behavior is unchanged. Avoid white box testing in unit tests; prefer black box testing that asserts on observable outcomes.

Referenced in: CD Testing, Unit Tests