Back to Projects

IIS Application Outage Investigation

Root Cause Analysis Using Prometheus & Grafana

Executive Summary

Incident Investigation & Resolution

Hospital Production Environment • 2025

When developers reported intermittent "Server Unavailable" errors affecting a production web application, I took the initiative to deploy monitoring infrastructure (Prometheus + Grafana + windows_exporter) on the IIS server to investigate. Through systematic data collection and analysis, I identified a memory leak in a co-hosted application that was causing resource exhaustion. The issue was escalated to developers with evidence-based findings and successfully resolved.

Memory leak identified
User complaints eliminated
Ongoing monitoring established

Problem Statement

Reported Issue

Symptom:

Random "Server Unavailable" errors reported by end users

Frequency:

Intermittent - difficult to reproduce on demand

Impact:

User workflow interruptions, multiple user complaints

Environment:

Windows Server running IIS with multiple co-hosted applications

The Challenge: With no obvious error patterns in application logs and intermittent failures, traditional troubleshooting wasn't revealing the root cause. I needed visibility into system-level resource consumption to identify what was happening during the outage windows.

My Approach

I took a data-driven, observability-first approach to investigate the issue. Rather than making configuration changes or restarting services blindly, I first instrumented the system to collect evidence.

01

Instrumentation

Action Taken:

Installed windows_exporter on the IIS server and enabled specific collectors:

  • IIS collector - Request queue depth, connections, requests/sec
  • .NET collector - Application pool memory, GC metrics
  • CPU collector - Processor utilization per core
  • Memory collector - Available memory, paging activity
  • Process collector - Per-process memory consumption

Why: Needed visibility into resource consumption at both system and application levels

02

Metrics Collection

Action Taken:

Configured our existing Prometheus instance to scrape the windows_exporter endpoint:

  • Added scrape target to prometheus.yml
  • Validated configuration with promtool check config
  • Verified metrics collection in Prometheus UI
  • Set appropriate scrape interval (30s) for granular data

Why: Centralized metrics storage for historical analysis and correlation

03

Visualization & Monitoring

Action Taken:

Built Grafana dashboards to visualize key metrics:

  • System Overview - CPU, memory, disk I/O
  • IIS Health - HTTP request queue size, active connections
  • Application Pools - Memory usage per pool over time
  • .NET Runtime - GC pauses, heap size, exceptions

Why: Visual correlation of metrics to identify patterns during incident windows

04

Data Analysis & Correlation

Action Taken:

Monitored the dashboards and correlated metrics with incident reports:

  • Matched user-reported outage times with dashboard timelines
  • Identified pattern: memory consumption steadily increasing
  • Traced spike to specific application pool (not the reported app)
  • Observed HTTP queue depth increase during memory pressure

Why: Transform user complaints into measurable, reproducible observations

05

Root Cause Identification

Finding:

A co-hosted application on the same IIS server had a memory leak:

  • Memory consumption increased steadily over hours
  • As available memory depleted, IIS became unresponsive
  • HTTP request queue grew as worker process struggled
  • Eventually caused "Server Unavailable" for ALL apps on that server

Key Insight: The reported application was a victim, not the cause. Without monitoring, we might have wasted time debugging the wrong application.

06

Escalation & Resolution

Action Taken:

  • Documented findings with dashboard screenshots
  • Prepared evidence showing memory leak pattern
  • Escalated to development team with specific application identified
  • Developers confirmed memory leak in code
  • Patch deployed to fix the leak
  • Continued monitoring post-fix to validate resolution

Result: Issue completely resolved. No recurrence of "Server Unavailable" errors.

Technical Implementation

Deployment

windows_exporter installation:

  • Downloaded MSI installer from official GitHub releases
  • Installed as Windows service with auto-start configuration
  • Configured firewall rule to allow Prometheus scraping (port 9182)
  • Enabled collectors: iis,cpu,memory,net,process
  • Validated metrics endpoint accessibility

Configuration

Prometheus scrape configuration:

- job_name: 'iis-webserver'
  static_configs:
    - targets: [':9182']
      labels:
        environment: 'production'
        role: 'webserver'

Validation:

promtool check config prometheus.yml

Dashboards Created

System Resources Dashboard

CPU utilization, memory usage, disk I/O, network throughput

IIS Performance Dashboard

Request queue depth, requests/sec, active connections, response times

Application Pool Memory Tracking

Per-pool memory consumption over time (this revealed the leak)

Root Cause Analysis

What the Data Showed

Observation 1: Memory Pattern

One application pool showed continuous memory growth over 4-6 hour periods. Memory would climb from ~500MB to 3GB+, eventually consuming most available RAM.

Observation 2: IIS Response Degradation

When system memory dropped below 15% available, IIS HTTP request queue depth spiked from ~5 to 100+ requests. Worker process became unresponsive.

Observation 3: Multi-Tenant Impact

Because multiple applications shared the same IIS server, the memory leak in ONE application caused outages for ALL applications on that host.

Conclusion:

The "Server Unavailable" errors were caused by a noisy neighbor problem - a memory leak in a co-hosted application starved system resources, making the entire IIS instance unresponsive. Application logs didn't show crashes because the issue was resource exhaustion, not application exceptions.

Outcome & Impact

Issue Resolved

Development team fixed the memory leak. Post-deployment monitoring confirmed stable memory usage.

User Impact

Zero recurrence of "Server Unavailable" errors. No further user complaints.

Ongoing Monitoring

Monitoring infrastructure remains in place for future incident response and capacity planning.

Knowledge Transfer

Investigation methodology and dashboards documented for operations team use.

Key Takeaways

Observability Beats Guesswork

Without monitoring, we might have spent days restarting services, changing configurations, or debugging the wrong application. Data-driven investigation led directly to root cause.

Multi-Tenant Risks

Co-hosting multiple applications on the same server creates risk. One misbehaving application can impact all tenants. Monitoring helps identify these "noisy neighbor" problems.

Evidence-Based Escalation

Presenting developers with graphs and metrics (not just "users are complaining") led to faster acknowledgment and resolution. Data builds credibility.

Proactive Infrastructure

Taking initiative to deploy monitoring tools without being asked demonstrates ownership and engineering mindset. It's not enough to react to problems - build systems to prevent them.

Skills Demonstrated

Root Cause Analysis

Systematic investigation using data and metrics to identify true cause vs. symptoms

Observability Engineering

Deployed and configured Prometheus, Grafana, and windows_exporter in production

Windows Infrastructure

Deep understanding of IIS, application pools, and Windows Server resource management

Cross-Team Collaboration

Worked with developers to translate infrastructure findings into code fixes

Initiative & Ownership

Self-directed deployment of monitoring stack to solve a recurring problem

Technical Documentation

Documented investigation process and created runbooks for future incidents

Technologies Used

windows_exporter
Prometheus
Grafana
IIS Web Server
PromQL
Windows Server

Observability-Driven Problem Resolution

This investigation demonstrates my approach to infrastructure challenges: build observability, let data drive decisions, collaborate across teams, and document for the next person. I'm looking for DevOps roles where this mindset is valued.