This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Production Visibility and Team Health

Symptoms related to production observability, incident detection, environment parity, and team sustainability.

These symptoms indicate problems with how your team sees and responds to production issues. When problems are invisible until customers report them, or when the team is burning out from process overhead, the delivery system is working against the people in it. Each page describes what you are seeing and links to the anti-patterns most likely causing it.

How to use this section

Start with the symptom that matches what your team experiences. Each symptom page explains what you are seeing, identifies the most likely root causes (anti-patterns), and provides diagnostic questions to narrow down which cause applies to your situation. Follow the anti-pattern link to find concrete fix steps.

Related anti-pattern categories: Monitoring and Observability Anti-Patterns, Organizational and Cultural Anti-Patterns

Related guides: Progressive Rollout, Working Agreements, Metrics-Driven Improvement

1 - The Team Ignores Alerts Because There Are Too Many

Alert volume is so high that pages fire for non-issues. Real problems are lost in the noise.

What you are seeing

The on-call phone goes off fourteen times this week. Eight of the pages were non-issues that resolved on their own. Three were false positives from a known monitoring misconfiguration that nobody has prioritized fixing. One was a real problem. The on-call engineer, conditioned by a week of false positives, dismisses the real page as another false alarm. The real problem goes unaddressed for four hours.

The team has more alerts than they can respond to meaningfully. Every metric has an alert. The thresholds were set during a brief period when everything was running smoothly and nobody has touched them since. When a database is slow, thirty alerts fire simultaneously for every downstream metric that depends on database performance. The alert storm is worse than the underlying problem.

Alert fatigue develops slowly. It starts with a few noisy alerts that are tolerated because fixing them is less urgent than current work. Each new service adds more alerts calibrated optimistically. Over time, the signal disappears in the noise, and the on-call rotation becomes a form of learned helplessness. Real incidents are discovered by users before they are discovered by the team.

Common causes

Blind operations

Teams that have not developed observability as a discipline often configure alerts as an afterthought. Every metric gets an alert, thresholds are guessed rather than calibrated, and alert correlation - multiple alerts from one underlying cause - is never considered. This approach produces alert storms, not actionable signals.

Good alerting requires deliberate design: alerts should be tied to user-visible symptoms rather than internal metrics, thresholds should be calibrated to real traffic patterns, and correlated alerts should suppress to a single notification. This design requires treating observability as a continuous practice rather than a one-time setup.

Read more: Blind operations

Missing deployment pipeline

A pipeline provides a natural checkpoint for validating monitoring configuration as part of each deployment. Without a pipeline, monitoring is configured manually at deployment time and never revisited in a structured way. Alert thresholds set at initial deployment are never recalibrated as traffic patterns change.

A pipeline that includes monitoring configuration as code - alert thresholds defined alongside the service code they monitor - makes alert configuration a versioned, reviewable artifact rather than a manual configuration that drifts.

Read more: Missing deployment pipeline

How to narrow it down

  1. What percentage of pages this week required action? If less than half required action, the alert signal-to-noise ratio is too low. Start with Blind operations.
  2. Are alert thresholds defined as code or set manually in a UI? Manual threshold configuration drifts and is never revisited. Start with Missing deployment pipeline.
  3. Do alerts fire at the symptom level (user-visible problems) or the metric level (internal system measurements)? Metric-level alerts create alert storms when one root cause affects many metrics. Start with Blind operations.

Ready to fix this? The most common cause is Blind operations. Start with its How to Fix It section for week-by-week steps.

2 - Team Burnout and Unsustainable Pace

The team is exhausted. Every sprint is a crunch sprint. There is no time for learning, improvement, or recovery.

What you are seeing

The team is always behind. Sprint commitments are missed or met only through overtime. Developers work evenings and weekends to hit deadlines, then start the next sprint already tired. There is no buffer for unplanned work, so every production incident or stakeholder escalation blows up the plan.

Nobody has time for learning, experimentation, or process improvement. Suggestions like “let’s improve our test suite” or “let’s automate that deployment” are met with “we don’t have time.” The irony is that the manual work those improvements would eliminate is part of what keeps the team too busy.

Attrition risk is high. The most experienced developers leave first because they have options. Their departure increases the load on whoever remains, accelerating the cycle.

Common causes

Thin-Spread Teams

When a small team owns too many products, every developer is stretched across multiple codebases. Context switching consumes 20 to 40 percent of their capacity. The team looks fully utilized but delivers less than a focused team half its size. The utilization trap (“keep everyone busy”) masks the real problem: the team has more responsibilities than it can sustain.

Read more: Thin-Spread Teams

Deadline-Driven Development

When every sprint is driven by an arbitrary deadline, the team never operates at a sustainable pace. There is no recovery period after a crunch because the next deadline starts immediately. Quality is the first casualty, which creates rework, which consumes future capacity, which makes the next deadline even harder to meet. The cycle accelerates until the team collapses.

Read more: Deadline-Driven Development

Unbounded WIP

When there is no limit on work in progress, the team starts many things and finishes few. Every developer juggles multiple items, each getting fragmented attention. The sensation of being constantly busy but never finishing anything is a direct contributor to burnout. The team is working hard on everything and completing nothing.

Read more: Unbounded WIP

Push-Based Work Assignment

When work is assigned to individuals, asking for help carries a cost: it pulls a teammate away from their own assigned stories. So developers struggle alone rather than swarming. Workloads are also uneven because managers cannot precisely predict how long work will take at assignment time. Some people finish early and wait for reassignment; others are chronically overloaded. The overloaded developers cannot refuse new assignments without appearing unproductive, so the pace becomes unsustainable for the people carrying the heaviest loads.

Read more: Push-Based Work Assignment

Velocity as Individual Metric

When individual story points are tracked, developers cannot afford to help each other, take time to learn, or invest in quality. Every hour must produce measurable output. The pressure to perform individually eliminates the slack that teams need to stay healthy. Helping a teammate, mentoring a junior developer, or improving a build script all become career risks because they do not produce points.

Read more: Velocity as Individual Metric

How to narrow it down

  1. Is the team responsible for more products than it can sustain? If developers are spread across many products with constant context switching, the workload exceeds what the team structure can handle. Start with Thin-Spread Teams.
  2. Is every sprint driven by an external deadline? If the team has not had a sprint without deadline pressure in months, the pace is unsustainable by design. Start with Deadline-Driven Development.
  3. Does the team have more items in progress than team members? If WIP is unbounded and developers juggle multiple items, the team is thrashing rather than delivering. Start with Unbounded WIP.
  4. Are individuals measured by story points or velocity? If developers feel pressure to maximize personal output at the expense of collaboration and sustainability, the measurement system is contributing to burnout. Start with Velocity as Individual Metric.
  5. Are workloads distributed unevenly, with some people chronically overloaded while others wait for new assignments? If the team cannot self-balance because work is assigned rather than pulled, the assignment model is driving the unsustainable pace. Start with Push-Based Work Assignment.

Ready to fix this? The most common cause is Thin-Spread Teams. Start with its How to Fix It section for week-by-week steps.


3 - When Something Breaks, Nobody Knows What to Do

There are no documented response procedures. Critical knowledge lives in one person’s head. Incidents are improvised every time.

What you are seeing

An alert fires at 2 AM. The on-call engineer looks at the dashboard and sees something is wrong with the payment service, but they have never been involved in a payment service incident before. They know the service is critical. They do not know the recovery procedure, the escalation path, the safe restart sequence, or the architectural context needed to diagnose the problem.

They wake up the one person who knows the payment service. That person is on vacation in a different time zone. They respond and start walking through the steps over a video call, explaining the system while simultaneously trying to diagnose the problem. The incident takes four hours to resolve, two of which were spent on knowledge transfer that should have been documented.

The team conducts a post-mortem. The action item is “document the payment service runbook.” The action item is added to the backlog. It does not get prioritized. Three months later, there is another 2 AM incident and the same knowledge transfer happens again.

Common causes

Knowledge silos

When system knowledge is not externalized into runbooks, architectural documentation, and operational procedures, it disappears when the person who holds it is unavailable. Incident response is the most time-pressured context in which to rediscover missing knowledge. The gap between “what we know collectively” and “what is documented” only becomes visible when the person who fills that gap is not present.

Teams that treat runbook maintenance as part of incident response - updating documentation immediately after resolving an incident, while the context is fresh - gradually close the gap. The runbook improves with every incident rather than remaining stale between rare documentation efforts.

Read more: Knowledge silos

Blind operations

Without adequate observability, diagnosing the cause of an incident requires deep system knowledge rather than reading dashboards. An on-call engineer with good observability can often identify the root cause of an incident from metrics, logs, and traces without needing the one person who understands the system internals. An on-call engineer without observability is flying blind, dependent on tribal knowledge.

Good observability turns incident response from an expert-only activity into something any trained engineer can do from a dashboard. The runbook points at the right metrics; the metrics tell the story.

Read more: Blind operations

Manual deployments

Systems deployed manually often have complex, undocumented operational characteristics. The manual deployment knowledge and the incident response knowledge are often held by the same person - because the person who knows how to deploy a service also knows how it behaves and how to recover it. This concentration of knowledge is a single point of failure.

Read more: Manual deployments

How to narrow it down

  1. Does every service have a runbook that an on-call engineer unfamiliar with the service could follow? If not, incident response requires specific people. Start with Knowledge silos.
  2. Can the on-call engineer determine the likely cause of an incident from dashboards alone? If diagnosing incidents requires deep system knowledge, observability is insufficient. Start with Blind operations.
  3. Is there a single person whose absence would make incident response significantly harder for multiple services? That person is a single point of failure. Start with Knowledge silos.

Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.

4 - Production Issues Discovered by Customers

The team finds out about production problems from support tickets, not alerts.

What you are seeing

The team deploys a change. Someone asks “is it working?” Nobody knows. There is no dashboard to check. There are no metrics to compare before and after. The team waits. If nobody complains within an hour, they assume the deployment was successful.

When something does go wrong, the team finds out from a customer support ticket, a Slack message from another team, or an executive asking why the site is slow. The investigation starts with SSH-ing into a server and reading raw log files. Hours pass before anyone understands what happened, what caused it, or how many users were affected.

Common causes

Blind Operations

The team has no application-level metrics, no centralized logging, and no alerting. The infrastructure may report that servers are running, but nobody can tell whether the application is actually working correctly. Without instrumentation, the only way to discover a problem is to wait for someone to experience it and report it.

Read more: Blind Operations

Manual Deployments

When deployments involve human steps (running scripts by hand, clicking through a console), there is no automated verification step. The deployment process ends when the human finishes the steps, not when the system confirms it is healthy. Without an automated pipeline that checks health metrics after deploying, verification falls to manual spot-checking or waiting for complaints.

Read more: Manual Deployments

Missing Deployment Pipeline

When there is no automated path from commit to production, there is nowhere to integrate automated health checks. A deployment pipeline can include post-deploy verification that compares metrics before and after. Without a pipeline, verification is entirely manual and usually skipped under time pressure.

Read more: Missing Deployment Pipeline

How to narrow it down

  1. Does the team have application-level metrics and alerts? If no, the team has no way to detect problems automatically. Start with Blind Operations.
  2. Is the deployment process automated with health checks? If deployments are manual or automated without post-deploy verification, problems go undetected until users report them. Start with Manual Deployments or Missing Deployment Pipeline.
  3. Does the team check a dashboard after every deployment? If the answer is “sometimes” or “we click through the app manually,” the verification step is unreliable. Start with Blind Operations to build automated verification.

Ready to fix this? The most common cause is Blind Operations. Start with its How to Fix It section for week-by-week steps.

5 - Logs Exist but Cannot Be Searched or Correlated

Every service writes logs, but they are not aggregated or queryable. Debugging requires SSH access to individual servers.

What you are seeing

Debugging a production problem requires SSH access to individual servers and manual correlation across log files. An engineer SSHes into the production server, navigates to the log directory, and greps through gigabytes of log files looking for error messages. The logs from three services involved in the failing request are on three different servers with three different log formats. Correlating events into a coherent timeline requires copying relevant lines into a document and sorting by timestamp manually.

Log rotation has pruned most of what might be relevant from two weeks ago when the issue likely started. The logs that exist are unstructured text mixed with stack traces. Field names differ between services: one logs user_id, another logs userId, a third logs uid. A query to find all errors from a specific user in the past hour would take thirty minutes to run manually across all servers.

The team knows this is a problem but treats it as “we need to add a log aggregation system eventually.” Eventually has not arrived. In the meantime, debugging production issues is slow, often incomplete, and dependent on whoever has the institutional knowledge to navigate the logging infrastructure.

Common causes

Blind operations

Unstructured, unaggregated logs are one form of not having instrumented a system for observability. Logs that cannot be searched or correlated are only marginally more useful than no logs at all. Observability requires structured logs with consistent field names, aggregated into a searchable store, with the ability to correlate log events across services by request ID or trace context.

Structured logging requires deliberate adoption: a standard log format, consistent field names, correlation identifiers on every log entry. When these are in place, a query that previously required thirty minutes of manual grepping across servers runs in seconds from a single interface.

Read more: Blind operations

Knowledge silos

Understanding how to navigate the logging infrastructure - which servers hold which logs, what the rotation schedule is, which grep patterns produce useful results - is knowledge that concentrates in the people who have done enough debugging to learn it. New team members cannot effectively debug production issues independently because they do not know the informal map of where things are.

When logs are aggregated into a centralized, searchable system, the knowledge of where to look is built into the tooling. Any team member can write a query without knowing the physical location of log files.

Read more: Knowledge silos

How to narrow it down

  1. Can the team search logs across all services from a single interface? If debugging requires SSH access to individual servers, logs are not aggregated. Start with Blind operations.
  2. Can the team trace a single request across multiple services using a shared correlation ID? If not, distributed debugging is manual assembly work. Start with Blind operations.
  3. Can new team members debug production issues independently, without help from senior engineers? If debugging requires knowing the informal map of log locations and formats, the knowledge is siloed. Start with Knowledge silos.

Ready to fix this? The most common cause is Blind operations. Start with its How to Fix It section for week-by-week steps.

6 - Leadership Sees CD as a Technical Nice-to-Have

Management does not understand why CD matters. No budget for tooling. No time allocated for improvement.

What you are seeing

Pipeline improvement work loses to feature delivery every sprint. The team wants to invest in deployment automation, test infrastructure, and pipeline improvements. The engineering manager supports this in principle. But every sprint, when capacity is allocated, the product backlog wins. There are features to ship, commitments to keep, a roadmap to deliver against. Pipeline improvements are real work - weeks of investment - but they do not appear on any roadmap and do not map to revenue-generating features.

When the team escalates to leadership, the response is supportive but non-committal: “Yes, we need to do that. Find a way to fit it in.” The team tries to fit it in - at the margins, in slack time, adjacent to feature work. The improvement work is slow, fragmented, and regularly displaced. Three years in, the pipeline is incrementally better, but the fundamental problems remain.

What is missing is organizational priority. CD adoption requires sustained investment - not a one-time sprint but ongoing capacity allocated to improving the delivery system. Without a sponsor who can protect that capacity from feature demand, improvement work will always lose to delivery pressure.

Common causes

Velocity as individual metric

When management measures progress by story points or feature delivery rate, investment in pipeline infrastructure looks like a reduction in output. A sprint where half the team works on deployment automation produces fewer feature story points than a sprint where everyone delivers features. Leaders optimizing for short-term throughput will consistently deprioritize it.

When lead time and deployment frequency are tracked alongside feature delivery, pipeline investment has a visible ROI. Leadership can see the case for it in the same dashboard they use for feature delivery - and pipeline work stops competing invisibly against features that do show up on a scoreboard.

Read more: Velocity as individual metric

Missing product ownership

Without a product owner who understands that delivery capability is itself a product attribute, pipeline work has no advocate in planning. Features with product owners get prioritized. Infrastructure work without sponsors does not. The team needs someone with organizational standing who can represent improvement work as a priority in the same planning conversation as feature work.

Read more: Missing product ownership

Deadline-driven development

When the organization is organized around fixed delivery dates, any work that does not directly advance the date looks like overhead. CD adoption requires investing in the delivery system itself, which competes with delivering to the schedule. Until management understands that delivery capability is what makes future schedules achievable, the investment will not be protected.

Read more: Deadline-driven development

How to narrow it down

  1. Does management measure and track delivery lead time, deployment frequency, and change fail rate? If not, the measurement system does not reward CD investment. Start with Velocity as individual metric.
  2. Is there an organizational sponsor who advocates for delivery capability improvements in planning? If improvement work has no sponsor, it will always lose to features with sponsors. Start with Missing product ownership.
  3. Is delivery organized around fixed commitment dates? If yes, anything not tied to the date is implicitly deprioritized. Start with Deadline-driven development.

Ready to fix this? The most common cause is Velocity as individual metric. Start with its How to Fix It section for week-by-week steps.

7 - Runbooks and Architecture Docs Are Years Out of Date

Deployment procedures, architecture diagrams, and operational runbooks describe a system that no longer matches reality.

What you are seeing

The runbook for the API service describes a deployment process involving a tool the team migrated away from two years ago. The architecture diagram shows four services; there are now eleven. The “how to add a new service” guide assumes a project structure that was refactored in the last rewrite. The documents are not wrong - they were accurate when written - but nobody updated them as the system evolved.

The team has learned to use documentation as a rough starting point and rely on tribal knowledge for the details. Senior engineers know which documents are outdated and which are still accurate. Newer team members cannot make this distinction and waste time following outdated procedures. Incidents that could be resolved in minutes take hours because the runbook does not match the system the on-call engineer is looking at.

The documentation gap compounds over time. Each change that is not documented increases the gap between documentation and reality. Eventually the gap is so large that nobody trusts any documentation, and all knowledge defaults to person-to-person transfer.

Common causes

Knowledge silos

When documentation is the only path from tribal knowledge to shared knowledge, and the team does not value documentation as a practice, knowledge accumulates in people rather than in records. The runbook written under pressure during an incident is the only runbook that gets written. Day-to-day changes that affect operations never get documented because the documentation habit is not part of the development workflow.

Teams that treat documentation as part of the definition of done - the change is not done until it is documented - produce documentation that stays current. Each change author updates the relevant runbooks and architectural records as part of completing the work.

Read more: Knowledge silos

Manual deployments

Systems deployed manually have deployment procedures that are highly contextual, learned by doing, and resistant to documentation. The deployment is a craft practice: the person executing it knows which steps to skip in which situations, which warnings to ignore, and which undocumented behaviors to watch for. Documenting this craft knowledge is difficult because it is tacit.

Automating the deployment process forces documentation into code. The pipeline definition is the authoritative deployment procedure. When the deployment changes, the pipeline definition changes. The code is always current because the code is the process.

Read more: Manual deployments

Snowflake environments

When environments evolve by hand, the gap between documented architecture and the actual running architecture grows with every undocumented change. An architecture diagram drawn at the last major redesign does not show the database added directly to production for a performance fix, the caching layer added informally, or the service split that happened in a hackathon. Infrastructure as code makes the infrastructure itself the documentation.

Read more: Snowflake environments

How to narrow it down

  1. Can the on-call engineer follow the runbook for a critical service without help from someone who knows the service? If not, the runbook is out of date. Start with Knowledge silos.
  2. Is the deployment procedure defined as pipeline code or as written documentation? Written documentation drifts; pipeline code is the process itself. Start with Manual deployments.
  3. Does the architecture documentation match the current production system? If the diagram and the reality diverge, the environments were changed without corresponding documentation. Start with Snowflake environments.

Ready to fix this? The most common cause is Knowledge silos. Start with its How to Fix It section for week-by-week steps.

8 - Production Problems Are Discovered Hours or Days Late

Issues in production are not discovered until users report them. There is no automated detection or alerting.

What you are seeing

A deployment goes out on Tuesday. On Thursday, a support ticket comes in: a feature is broken for a subset of users. The team investigates and discovers the problem was introduced in Tuesday’s deploy. For two days, users experienced the issue while the team had no idea.

Or a performance degradation appears gradually. Response times creep up over a week. Nobody notices until a customer complains or a business metric drops. The team checks the dashboards and sees the degradation started after a specific deploy, but the deploy was days ago and the trail is cold.

The team deploys carefully and then “watches for a while.” Watching means checking a few URLs manually or refreshing a dashboard for 15 minutes. If nothing obviously breaks in that window, the deployment is declared successful. Problems that manifest slowly, affect a subset of users, or appear under specific conditions go undetected.

Common causes

Blind Operations

When the team has no monitoring, no alerting, and no aggregated logging, production is a black box. The only signal that something is wrong comes from users, support staff, or business reports. The team cannot detect problems because they have no instruments to detect them with. Adding observability (metrics, structured logging, distributed tracing, alerting) gives the team eyes on production.

Read more: Blind Operations

Undone Work

When the team’s definition of done does not include post-deployment verification, nobody is responsible for confirming that the deployment is healthy. The story is “done” when the code is merged or deployed, not when it is verified in production. Health checks, smoke tests, and canary analysis are not part of the workflow because the workflow ends before production.

Read more: Undone Work

Manual Deployments

When deployments are manual, there is no automated post-deploy verification step. An automated pipeline can include health checks, smoke tests, and rollback triggers as part of the deployment sequence. A manual deployment ends when the human finishes the runbook. Whether the deployment is actually healthy is a separate question that may or may not get answered.

Read more: Manual Deployments

How to narrow it down

  1. Does the team have production monitoring with alerting thresholds? If not, the team cannot detect problems that users do not report. Start with Blind Operations.
  2. Does the team’s definition of done include post-deploy verification? If stories are closed before production health is confirmed, nobody owns the detection step. Start with Undone Work.
  3. Does the deployment process include automated health checks? If deployments end when the human finishes the script, there is no automated verification. Start with Manual Deployments.

Ready to fix this? The most common cause is Blind Operations. Start with its How to Fix It section for week-by-week steps.


9 - It Works on My Machine

Code that works in one developer’s environment fails in another, in CI, or in production. Environment differences make results unreproducible.

What you are seeing

A developer runs the application locally and everything works. They push to CI and the build fails. Or a teammate pulls the same branch and gets a different result. Or a bug report comes in that nobody can reproduce locally.

The team spends hours debugging only to discover the issue is environmental: a different Node version, a missing system library, a different database encoding, or a service running on the developer’s machine that is not available in CI. The code is correct. The environments are different.

New team members experience this acutely. Setting up a development environment takes days of following an outdated wiki page, asking teammates for help, and discovering undocumented dependencies. Every developer’s machine accumulates unique configuration over time, making “works on my machine” a common refrain and a useless debugging signal.

Common causes

Snowflake Environments

When development environments are set up manually and maintained individually, each developer’s machine becomes unique. One developer installed Python 3.9, another has 3.11. One has PostgreSQL 14, another has 15. These differences are invisible until someone hits a version-specific behavior. Reproducible, containerized development environments eliminate the variance by ensuring every developer works in an identical setup.

Read more: Snowflake Environments

Manual Deployments

When environment setup is a manual process documented in a wiki or README, it is never followed identically. Each developer interprets the instructions slightly differently, installs a slightly different version, or skips a step that seems optional. The manual process guarantees divergence over time. Infrastructure as code and automated setup scripts ensure consistency.

Read more: Manual Deployments

Tightly Coupled Monolith

When the application has implicit dependencies on its environment (specific file paths, locally running services, system-level configuration), it is inherently sensitive to environmental differences. Well-designed code with explicit, declared dependencies works the same way everywhere. Code that reaches into its runtime environment for undeclared dependencies works only where those dependencies happen to exist.

Read more: Tightly Coupled Monolith

How to narrow it down

  1. Do all developers use the same OS, runtime versions, and dependency versions? If not, environment divergence is the most likely cause. Start with Snowflake Environments.
  2. Is the development environment setup automated or manual? If it is a wiki page that takes a day to follow, the manual process creates the divergence. Start with Manual Deployments.
  3. Does the application depend on local services, file paths, or system configuration that is not declared in the codebase? If the application has implicit environmental dependencies, it will behave differently wherever those dependencies differ. Start with Tightly Coupled Monolith.

Ready to fix this? The most common cause is Snowflake Environments. Start with its How to Fix It section for week-by-week steps.