Root Cause Analysis Using Prometheus & Grafana
Hospital Production Environment • 2025
When developers reported intermittent "Server Unavailable" errors affecting a production web application, I took the initiative to deploy monitoring infrastructure (Prometheus + Grafana + windows_exporter) on the IIS server to investigate. Through systematic data collection and analysis, I identified a memory leak in a co-hosted application that was causing resource exhaustion. The issue was escalated to developers with evidence-based findings and successfully resolved.
Random "Server Unavailable" errors reported by end users
Intermittent - difficult to reproduce on demand
User workflow interruptions, multiple user complaints
Windows Server running IIS with multiple co-hosted applications
The Challenge: With no obvious error patterns in application logs and intermittent failures, traditional troubleshooting wasn't revealing the root cause. I needed visibility into system-level resource consumption to identify what was happening during the outage windows.
I took a data-driven, observability-first approach to investigate the issue. Rather than making configuration changes or restarting services blindly, I first instrumented the system to collect evidence.
Action Taken:
Installed windows_exporter on the IIS server and enabled specific collectors:
Why: Needed visibility into resource consumption at both system and application levels
Action Taken:
Configured our existing Prometheus instance to scrape the windows_exporter endpoint:
prometheus.ymlpromtool check configWhy: Centralized metrics storage for historical analysis and correlation
Action Taken:
Built Grafana dashboards to visualize key metrics:
Why: Visual correlation of metrics to identify patterns during incident windows
Action Taken:
Monitored the dashboards and correlated metrics with incident reports:
Why: Transform user complaints into measurable, reproducible observations
Finding:
A co-hosted application on the same IIS server had a memory leak:
Key Insight: The reported application was a victim, not the cause. Without monitoring, we might have wasted time debugging the wrong application.
Action Taken:
Result: Issue completely resolved. No recurrence of "Server Unavailable" errors.
windows_exporter installation:
iis,cpu,memory,net,processPrometheus scrape configuration:
- job_name: 'iis-webserver'
static_configs:
- targets: [':9182']
labels:
environment: 'production'
role: 'webserver'
Validation:
promtool check config prometheus.yml
CPU utilization, memory usage, disk I/O, network throughput
Request queue depth, requests/sec, active connections, response times
Per-pool memory consumption over time (this revealed the leak)
One application pool showed continuous memory growth over 4-6 hour periods. Memory would climb from ~500MB to 3GB+, eventually consuming most available RAM.
When system memory dropped below 15% available, IIS HTTP request queue depth spiked from ~5 to 100+ requests. Worker process became unresponsive.
Because multiple applications shared the same IIS server, the memory leak in ONE application caused outages for ALL applications on that host.
The "Server Unavailable" errors were caused by a noisy neighbor problem - a memory leak in a co-hosted application starved system resources, making the entire IIS instance unresponsive. Application logs didn't show crashes because the issue was resource exhaustion, not application exceptions.
Development team fixed the memory leak. Post-deployment monitoring confirmed stable memory usage.
Zero recurrence of "Server Unavailable" errors. No further user complaints.
Monitoring infrastructure remains in place for future incident response and capacity planning.
Investigation methodology and dashboards documented for operations team use.
Without monitoring, we might have spent days restarting services, changing configurations, or debugging the wrong application. Data-driven investigation led directly to root cause.
Co-hosting multiple applications on the same server creates risk. One misbehaving application can impact all tenants. Monitoring helps identify these "noisy neighbor" problems.
Presenting developers with graphs and metrics (not just "users are complaining") led to faster acknowledgment and resolution. Data builds credibility.
Taking initiative to deploy monitoring tools without being asked demonstrates ownership and engineering mindset. It's not enough to react to problems - build systems to prevent them.
Systematic investigation using data and metrics to identify true cause vs. symptoms
Deployed and configured Prometheus, Grafana, and windows_exporter in production
Deep understanding of IIS, application pools, and Windows Server resource management
Worked with developers to translate infrastructure findings into code fixes
Self-directed deployment of monitoring stack to solve a recurring problem
Documented investigation process and created runbooks for future incidents
This investigation demonstrates my approach to infrastructure challenges: build observability, let data drive decisions, collaborate across teams, and document for the next person. I'm looking for DevOps roles where this mindset is valued.