Production Incident Investigation

Executive Summary

Incident Investigation & Resolution

Hospital Production Environment • 2025

When developers reported intermittent "Server Unavailable" errors affecting a production web application, I took the initiative to deploy monitoring infrastructure (Prometheus + Grafana + windows_exporter) on the IIS server to investigate. Through systematic data collection and analysis, I identified a memory leak in a co-hosted application that was causing resource exhaustion. The issue was escalated to developers with evidence-based findings and successfully resolved.

Memory leak identified

User complaints eliminated

Ongoing monitoring established

Problem Statement

Reported Issue

Symptom:

Random "Server Unavailable" errors reported by end users

Frequency:

Intermittent - difficult to reproduce on demand

Impact:

User workflow interruptions, multiple user complaints

Environment:

Windows Server running IIS with multiple co-hosted applications

The Challenge: With no obvious error patterns in application logs and intermittent failures, traditional troubleshooting wasn't revealing the root cause. I needed visibility into system-level resource consumption to identify what was happening during the outage windows.

My Approach

I took a data-driven, observability-first approach to investigate the issue. Rather than making configuration changes or restarting services blindly, I first instrumented the system to collect evidence.

Instrumentation

Action Taken:

Installed windows_exporter on the IIS server and enabled specific collectors:

IIS collector - Request queue depth, connections, requests/sec
.NET collector - Application pool memory, GC metrics
CPU collector - Processor utilization per core
Memory collector - Available memory, paging activity
Process collector - Per-process memory consumption

Why: Needed visibility into resource consumption at both system and application levels

Metrics Collection

Action Taken:

Configured our existing Prometheus instance to scrape the windows_exporter endpoint:

Added scrape target to prometheus.yml
Validated configuration with promtool check config
Verified metrics collection in Prometheus UI
Set appropriate scrape interval (30s) for granular data

Why: Centralized metrics storage for historical analysis and correlation

Visualization & Monitoring

Action Taken:

Built Grafana dashboards to visualize key metrics:

System Overview - CPU, memory, disk I/O
IIS Health - HTTP request queue size, active connections
Application Pools - Memory usage per pool over time
.NET Runtime - GC pauses, heap size, exceptions

Why: Visual correlation of metrics to identify patterns during incident windows

Data Analysis & Correlation

Action Taken:

Monitored the dashboards and correlated metrics with incident reports:

Matched user-reported outage times with dashboard timelines
Identified pattern: memory consumption steadily increasing
Traced spike to specific application pool (not the reported app)
Observed HTTP queue depth increase during memory pressure

Why: Transform user complaints into measurable, reproducible observations

Root Cause Identification

Finding:

A co-hosted application on the same IIS server had a memory leak:

Memory consumption increased steadily over hours
As available memory depleted, IIS became unresponsive
HTTP request queue grew as worker process struggled
Eventually caused "Server Unavailable" for ALL apps on that server

Key Insight: The reported application was a victim, not the cause. Without monitoring, we might have wasted time debugging the wrong application.

Escalation & Resolution

Action Taken:

Documented findings with dashboard screenshots
Prepared evidence showing memory leak pattern
Escalated to development team with specific application identified
Developers confirmed memory leak in code
Patch deployed to fix the leak
Continued monitoring post-fix to validate resolution

Result: Issue completely resolved. No recurrence of "Server Unavailable" errors.

Technical Implementation

Deployment

windows_exporter installation:

Downloaded MSI installer from official GitHub releases
Installed as Windows service with auto-start configuration
Configured firewall rule to allow Prometheus scraping (port 9182)
Enabled collectors: iis,cpu,memory,net,process
Validated metrics endpoint accessibility

Configuration

Prometheus scrape configuration:

- job_name: 'iis-webserver'
  static_configs:
    - targets: [':9182']
      labels:
        environment: 'production'
        role: 'webserver'

Validation:

promtool check config prometheus.yml

Dashboards Created

System Resources Dashboard

CPU utilization, memory usage, disk I/O, network throughput

IIS Performance Dashboard

Request queue depth, requests/sec, active connections, response times

Application Pool Memory Tracking

Per-pool memory consumption over time (this revealed the leak)

Root Cause Analysis

What the Data Showed

Observation 1: Memory Pattern

One application pool showed continuous memory growth over 4-6 hour periods. Memory would climb from ~500MB to 3GB+, eventually consuming most available RAM.

Observation 2: IIS Response Degradation

When system memory dropped below 15% available, IIS HTTP request queue depth spiked from ~5 to 100+ requests. Worker process became unresponsive.

Observation 3: Multi-Tenant Impact

Because multiple applications shared the same IIS server, the memory leak in ONE application caused outages for ALL applications on that host.

Conclusion:

The "Server Unavailable" errors were caused by a noisy neighbor problem - a memory leak in a co-hosted application starved system resources, making the entire IIS instance unresponsive. Application logs didn't show crashes because the issue was resource exhaustion, not application exceptions.

Outcome & Impact

Issue Resolved

Development team fixed the memory leak. Post-deployment monitoring confirmed stable memory usage.

User Impact

Zero recurrence of "Server Unavailable" errors. No further user complaints.

Ongoing Monitoring

Monitoring infrastructure remains in place for future incident response and capacity planning.

Knowledge Transfer

Investigation methodology and dashboards documented for operations team use.

Key Takeaways

Observability Beats Guesswork

Without monitoring, we might have spent days restarting services, changing configurations, or debugging the wrong application. Data-driven investigation led directly to root cause.

Multi-Tenant Risks

Co-hosting multiple applications on the same server creates risk. One misbehaving application can impact all tenants. Monitoring helps identify these "noisy neighbor" problems.

Evidence-Based Escalation

Presenting developers with graphs and metrics (not just "users are complaining") led to faster acknowledgment and resolution. Data builds credibility.

Proactive Infrastructure

Taking initiative to deploy monitoring tools without being asked demonstrates ownership and engineering mindset. It's not enough to react to problems - build systems to prevent them.

Skills Demonstrated

Root Cause Analysis

Systematic investigation using data and metrics to identify true cause vs. symptoms

Observability Engineering

Deployed and configured Prometheus, Grafana, and windows_exporter in production

Windows Infrastructure

Deep understanding of IIS, application pools, and Windows Server resource management

Cross-Team Collaboration

Worked with developers to translate infrastructure findings into code fixes

Initiative & Ownership

Self-directed deployment of monitoring stack to solve a recurring problem

Technical Documentation

Documented investigation process and created runbooks for future incidents

Technologies Used

windows_exporter

Prometheus

Grafana

IIS Web Server

PromQL

Windows Server

Observability-Driven Problem Resolution

This investigation demonstrates my approach to infrastructure challenges: build observability, let data drive decisions, collaborate across teams, and document for the next person. I'm looking for DevOps roles where this mindset is valued.

Let's Talk About This Project View Other Projects

Project Details

Type Production Incident

Environment IIS Server

Status ✅ Resolved

Recurrence Zero

IIS Application Outage Investigation

Executive Summary

Incident Investigation & Resolution

Problem Statement

Reported Issue

My Approach

Instrumentation

Metrics Collection

Visualization & Monitoring

Data Analysis & Correlation

Root Cause Identification

Escalation & Resolution

Technical Implementation

Deployment

Configuration

Dashboards Created

Root Cause Analysis

What the Data Showed

Observation 1: Memory Pattern

Observation 2: IIS Response Degradation

Observation 3: Multi-Tenant Impact

Outcome & Impact

Issue Resolved

User Impact

Ongoing Monitoring

Knowledge Transfer

Key Takeaways

Observability Beats Guesswork

Multi-Tenant Risks

Evidence-Based Escalation

Proactive Infrastructure

Skills Demonstrated

Root Cause Analysis

Observability Engineering

Windows Infrastructure

Cross-Team Collaboration

Initiative & Ownership

Technical Documentation

Technologies Used

Observability-Driven Problem Resolution