Prometheus and Grafana: A Practical Monitoring Guide
Monitoring is not optional—it’s the difference between proactively fixing issues and getting woken up at 3 AM by angry customers. This guide walks you through setting up Prometheus and Grafana, the most widely adopted open-source monitoring stack in the industry.
We’ll start with the fundamentals and progress to production-ready configurations that work across Docker, Kubernetes, and traditional EC2/VM environments.
Why Prometheus and Grafana?
Before diving into setup, let’s understand why this stack has become the industry standard.
The Monitoring Landscape
| Tool | Type | Best For |
|---|---|---|
| Prometheus | Metrics collection & storage | Time-series data, alerting |
| Grafana | Visualization | Dashboards, graphs, exploration |
| CloudWatch | AWS-native monitoring | AWS services, basic metrics |
| Datadog | SaaS monitoring | Full observability (paid) |
| ELK Stack | Log aggregation | Log analysis, search |
Why Choose Prometheus + Grafana?
Prometheus:
- Pull-based model (more reliable than push)
- Powerful query language (PromQL)
- Built-in alerting
- Service discovery (especially for Kubernetes)
- No external dependencies (single binary)
- Free and open source
Grafana:
- Beautiful, customizable dashboards
- Supports multiple data sources (Prometheus, CloudWatch, Elasticsearch, etc.)
- Alerting capabilities
- Large community with pre-built dashboards
- Free and open source
Understanding the Architecture
Here’s how the monitoring stack fits together:
┌─────────────────────────────────────────────────────────────────┐
│ YOUR INFRASTRUCTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ App 1 │ │ App 2 │ │ App 3 │ │
│ │ :8080 │ │ :8081 │ │ :8082 │ │
│ │ /metrics │ │ /metrics │ │ /metrics │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┬┴─────────────────┘ │
│ │ │
│ SCRAPE │ (pull metrics) │
│ ▼ │
│ ┌─────────────────┐ │
│ │ PROMETHEUS │ │
│ │ :9090 │ │
│ │ ┌───────────┐ │ │
│ │ │ TSDB │ │ (Time Series Database) │
│ │ └───────────┘ │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ GRAFANA │ │ ALERTMGR │ │ API/APPS │ │
│ │ :3000 │ │ :9093 │ │ (queries) │ │
│ └─────────────┘ └────┬─────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Slack / PagerDuty / │ │
│ │ Email / Webhook │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Key Components
| Component | Port | Purpose |
|---|---|---|
| Prometheus | 9090 | Scrapes and stores metrics |
| Grafana | 3000 | Visualizes metrics |
| Alertmanager | 9093 | Routes and manages alerts |
| Node Exporter | 9100 | Exposes host/VM metrics |
| cAdvisor | 8080 | Exposes container metrics |
Pull vs Push Model
Prometheus uses a pull-based model:
Push Model (traditional): Pull Model (Prometheus):
┌─────┐ ┌─────┐
│ App │──push──▶ │Monitor│ │ App │◀──scrape──│Prometheus│
└─────┘ └─────┘
(app exposes /metrics)
Why pull is better:
- Prometheus controls the scrape interval
- Easier to detect if a target is down (scrape fails)
- No need for apps to know where to push
- Simpler firewall rules (Prometheus initiates connections)
Core Concepts
Before setting things up, let’s understand the terminology.
Metric Types
Prometheus has four core metric types:
1. Counter
A value that only goes up (or resets to zero on restart).
# Example: Total HTTP requests
http_requests_total{method="GET", status="200"} 1234
Use for: Request counts, error counts, bytes transferred
2. Gauge
A value that can go up or down.
# Example: Current memory usage
node_memory_MemFree_bytes 1073741824
Use for: Temperature, memory usage, queue size, active connections
3. Histogram
Samples observations and counts them in configurable buckets.
# Example: Request duration distribution
http_request_duration_seconds_bucket{le="0.1"} 500
http_request_duration_seconds_bucket{le="0.5"} 800
http_request_duration_seconds_bucket{le="1.0"} 950
http_request_duration_seconds_count 1000
http_request_duration_seconds_sum 450.5
Use for: Request latencies, response sizes
4. Summary
Similar to histogram but calculates quantiles on the client side.
# Example: Request duration quantiles
http_request_duration_seconds{quantile="0.5"} 0.05
http_request_duration_seconds{quantile="0.9"} 0.1
http_request_duration_seconds{quantile="0.99"} 0.5
Use for: When you need specific percentiles (p50, p90, p99)
Labels
Labels are key-value pairs that add dimensions to metrics:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234
http_requests_total{method="POST", endpoint="/api/users", status="201"} 567
http_requests_total{method="GET", endpoint="/api/users", status="500"} 12
Best practices:
- Keep label cardinality low (avoid user IDs, request IDs)
- Use labels for dimensions you’ll filter/group by
- Be consistent with naming across services
Scrape Targets
A target is an endpoint Prometheus scrapes for metrics:
# prometheus.yml
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['app1:8080', 'app2:8080', 'app3:8080']
Each target exposes metrics at /metrics endpoint in this format:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 567
Setting Up Prometheus
Let’s set up Prometheus in different environments. I’ll show Docker first (universal), then Kubernetes.
Option 1: Docker Compose (Works Everywhere)
This setup works on EC2, any VM, or your local machine.
Create a project directory:
mkdir monitoring && cd monitoring
Create docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:10.1.0
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
# Exports host metrics (CPU, memory, disk, network)
node-exporter:
image: prom/node-exporter:v1.6.1
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
# Exports container metrics
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
external_labels:
monitor: 'my-monitor'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load alert rules
rule_files:
- /etc/prometheus/alerts.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (host metrics)
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# cAdvisor (container metrics)
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Your applications (add your apps here)
# - job_name: 'my-app'
# static_configs:
# - targets: ['app:8080']
Create prometheus/alerts.yml:
groups:
- name: infrastructure
rules:
# Instance down for more than 1 minute
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance down"
description: " of job has been down for more than 1 minute."
# High CPU usage
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on "
description: "CPU usage is above 80% (current: %)"
# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on "
description: "Memory usage is above 85% (current: %)"
# Disk space low
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on "
description: "Disk usage is above 85% on (current: %)"
Create alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
receivers:
- name: 'default-receiver'
# Configure your notification channel here
# Example: Slack webhook
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/xxx/xxx/xxx'
# channel: '#alerts'
- name: 'critical-receiver'
# Critical alerts go to a different channel
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/xxx/xxx/xxx'
# channel: '#critical-alerts'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Create Grafana provisioning for auto-setup:
mkdir -p grafana/provisioning/datasources
mkdir -p grafana/provisioning/dashboards
Create grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Start everything:
docker compose up -d
Verify all containers are running:
docker compose ps
Access the UIs:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Alertmanager: http://localhost:9093
Option 2: Kubernetes Deployment
For Kubernetes, the recommended approach is using the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, Alertmanager, and pre-configured dashboards.
Prerequisites
# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Basic Installation
# Create namespace
kubectl create namespace monitoring
# Install with default settings
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring
Custom Installation with Values
Create values.yaml for customization:
# Prometheus configuration
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
# Add additional scrape configs
additionalScrapeConfigs:
- job_name: 'my-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Grafana configuration
grafana:
adminPassword: "your-secure-password"
persistence:
enabled: true
size: 10Gi
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 200m
# Alertmanager configuration
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- match:
alertname: Watchdog
receiver: 'null'
receivers:
- name: 'null'
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
# Node Exporter
nodeExporter:
enabled: true
# kube-state-metrics
kubeStateMetrics:
enabled: true
Install with custom values:
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values.yaml
Accessing the UIs
# Port forward Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
# Port forward Grafana
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
# Port forward Alertmanager
kubectl port-forward svc/prometheus-kube-prometheus-alertmanager -n monitoring 9093:9093
Making Your Pods Scrapable
Add these annotations to your pods to be automatically discovered:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: my-app
image: my-app:latest
ports:
- containerPort: 8080
Or create a ServiceMonitor (more Kubernetes-native):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
labels:
release: prometheus # Must match Helm release name
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- default
endpoints:
- port: http
interval: 15s
path: /metrics
Setting Up Grafana
Once Grafana is running, let’s configure it properly.
Adding Prometheus Data Source
If not auto-provisioned:
- Go to Configuration → Data Sources
- Click Add data source
- Select Prometheus
- Set URL:
http://prometheus:9090(Docker) orhttp://prometheus-kube-prometheus-prometheus.monitoring.svc:9090(Kubernetes) - Click Save & Test
Importing Pre-built Dashboards
Don’t build dashboards from scratch—start with community dashboards:
- Go to Dashboards → Import
- Enter the dashboard ID and click Load
- Select your Prometheus data source
- Click Import
Recommended Dashboard IDs:
| Dashboard | ID | Purpose |
|---|---|---|
| Node Exporter Full | 1860 | Host metrics (CPU, memory, disk, network) |
| Docker Container & Host | 10619 | Container metrics via cAdvisor |
| Kubernetes Cluster | 6417 | K8s cluster overview |
| Kubernetes Pods | 6336 | Pod-level metrics |
| Prometheus Stats | 2 | Prometheus self-monitoring |
Building a Custom Dashboard
Let’s create a simple application dashboard:
- Click Dashboards → New → New Dashboard
- Click Add visualization
- Select your Prometheus data source
Example panels:
Request Rate
sum(rate(http_requests_total[5m])) by (status)
- Visualization: Time series
- Legend: ``
Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
- Visualization: Stat or Gauge
- Unit: Percent (0-100)
Request Latency (p99)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
- Visualization: Time series
- Unit: Seconds
Active Connections
sum(active_connections)
- Visualization: Stat
What Should You Monitor?
The Four Golden Signals
Google’s Site Reliability Engineering book defines four golden signals:
| Signal | What to Measure | Example Metric |
|---|---|---|
| Latency | Time to serve a request | http_request_duration_seconds |
| Traffic | Demand on your system | http_requests_total |
| Errors | Failed requests rate | http_requests_total{status=~"5.."} |
| Saturation | How full your system is | CPU, memory, disk usage |
USE Method (for Resources)
For infrastructure resources, use the USE method:
| Metric | CPU | Memory | Disk | Network |
|---|---|---|---|---|
| Utilization | % busy | % used | % full | bandwidth used |
| Saturation | run queue length | swap usage | I/O wait | dropped packets |
| Errors | — | OOM kills | disk errors | NIC errors |
RED Method (for Services)
For microservices:
| Metric | Description |
|---|---|
| Rate | Requests per second |
| Errors | Failed requests per second |
| Duration | Time per request (latency) |
PromQL: Querying Your Metrics
PromQL (Prometheus Query Language) is how you query metrics. Let’s go from basics to practical examples.
Basic Queries
# Simple metric
up
# With label filter
up{job="node-exporter"}
# Multiple label filters
http_requests_total{method="GET", status="200"}
# Regex match
http_requests_total{status=~"2.."}
# Negative match
http_requests_total{status!="500"}
Common Functions
rate() - Per-second rate of increase (for counters)
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
increase() - Total increase over time range
# Total requests in the last hour
increase(http_requests_total[1h])
sum() - Aggregate across labels
# Total requests per second across all instances
sum(rate(http_requests_total[5m]))
# Total requests per second grouped by status
sum(rate(http_requests_total[5m])) by (status)
avg(), min(), max()
# Average CPU across all instances
avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Max memory usage
max(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
histogram_quantile() - Calculate percentiles
# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 50th percentile (median)
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Practical Query Examples
CPU Usage %
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage %
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk Usage %
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100
Error Rate %
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Request Latency p95
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
Container CPU Usage
sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100
Container Memory Usage
sum(container_memory_usage_bytes{name!=""}) by (name) / 1024 / 1024
Top 5 Pods by CPU
topk(5, sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (pod))
Alerting with Alertmanager
Alertmanager handles alert routing, grouping, and notification.
Alert Flow
┌────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Prometheus │─────▶│ Alertmanager │─────▶│ Notification │
│ (evaluate │ │ (route, │ │ (Slack, Email, │
│ rules) │ │ dedupe) │ │ PagerDuty) │
└────────────┘ └──────────────┘ └─────────────────┘
Alert Rule Structure
groups:
- name: example
rules:
- alert: HighErrorRate # Alert name
expr: | # PromQL expression
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m # Must be true for this duration
labels:
severity: critical # Custom labels
annotations:
summary: "High error rate detected"
description: "Error rate is %"
Alertmanager Routing
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait before sending subsequent notifications
repeat_interval: 4h # Resend notifications for ongoing alerts
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
# Warning alerts → Slack
- match:
severity: warning
receiver: 'slack'
Notification Examples
Slack
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/xxx/xxx'
channel: '#alerts'
send_resolved: true
title: ': '
text: ''
PagerDuty
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-service-key'
severity: ''
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app-password'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
send_resolved: true
Essential Alerts to Start With
Here’s a starter alert rules file covering the basics:
groups:
- name: instance-health
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance is down"
- name: host-resources
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on "
description: "CPU usage is %"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory on "
description: "Memory usage is %"
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on "
description: "Disk is % full"
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is %"
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "p99 latency is s"
Production Considerations
Moving from development to production? Here’s what to think about.
Storage and Retention
# prometheus.yml or Helm values
prometheus:
prometheusSpec:
retention: 15d # How long to keep data
retentionSize: 50GB # Or limit by size
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3 # Use fast storage
resources:
requests:
storage: 100Gi
Rule of thumb: Each time series uses ~1-2 bytes per sample. With 15-second scrape interval:
- 1000 time series × 86400 seconds/day ÷ 15 × 2 bytes ≈ 11.5 MB/day
High Availability
For production, run multiple Prometheus instances:
# Two Prometheus instances scraping the same targets
prometheus-1: ──┬── scrapes ── targets
prometheus-2: ──┘
# Query through Thanos or Prometheus federation
Or use Thanos for long-term storage and global view:
┌────────────┐ ┌────────────┐
│ Prometheus │ │ Prometheus │
│ (cluster 1)│ │ (cluster 2)│
└─────┬──────┘ └─────┬──────┘
│ │
▼ ▼
┌─────────────────────────────────┐
│ Thanos Query │
├─────────────────────────────────┤
│ Thanos Store (S3/GCS) │
└─────────────────────────────────┘
Resource Requirements
| Component | CPU | Memory | Storage |
|---|---|---|---|
| Prometheus | 0.5-2 cores | 2-8 GB | 50-500 GB |
| Grafana | 0.1-0.5 cores | 256-512 MB | 1-10 GB |
| Alertmanager | 0.1 cores | 128-256 MB | 1 GB |
| Node Exporter | 0.1 cores | 50 MB | - |
Security
- Don’t expose Prometheus directly to the internet
# Bad: Prometheus on public IP
# Good: Behind reverse proxy with auth
# nginx example
location /prometheus/ {
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:9090/;
}
- Use network policies in Kubernetes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- port: 9090
- Secure Grafana
# Environment variables
GF_SECURITY_ADMIN_PASSWORD: "strong-password"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_AUTH_ANONYMOUS_ENABLED: "false"
Scaling Tips
| Problem | Solution |
|---|---|
| Too many targets | Use hierarchical federation |
| High cardinality | Reduce labels, drop unused metrics |
| Slow queries | Add recording rules for expensive queries |
| Grafana slow | Enable query caching |
| Storage full | Reduce retention, use remote storage |
Troubleshooting Common Issues
Prometheus Not Scraping Targets
Symptom: Targets show as DOWN in Prometheus UI
Check 1: Can Prometheus reach the target?
# From Prometheus container/pod
curl http://target:port/metrics
Check 2: Is the metrics endpoint working?
# Should return Prometheus format metrics
curl http://localhost:8080/metrics
Check 3: Firewall/Security groups
# EC2: Check security group allows port
# K8s: Check NetworkPolicy
kubectl get networkpolicy -A
Check 4: Service discovery (Kubernetes)
# Check if ServiceMonitor is picked up
kubectl get servicemonitor -n monitoring
# Check Prometheus config
kubectl get secret prometheus-kube-prometheus-prometheus -n monitoring -o jsonpath='{.data.prometheus\.yaml\.gz}' | base64 -d | gunzip
High Memory Usage
Symptom: Prometheus using too much RAM
Cause: Usually high cardinality (too many unique label combinations)
Diagnose:
# Check TSDB stats
prometheus_tsdb_head_series
# Find high cardinality metrics
topk(10, count by (__name__)({__name__=~".+"}))
Fix: Drop unused labels or metrics in scrape config:
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['app:8080']
metric_relabel_configs:
# Drop metrics you don't need
- source_labels: [__name__]
regex: 'go_.*'
action: drop
# Drop high-cardinality labels
- regex: 'id'
action: labeldrop
Grafana Not Showing Data
Symptom: Panels show “No data”
Check 1: Data source connection
- Go to Data Sources → Prometheus → Save & Test
Check 2: Time range
- Ensure the time range selector includes when data was collected
Check 3: Query syntax
- Test query directly in Prometheus UI at
:9090/graph
Check 4: Metric exists
# In Prometheus, check if metric exists
{__name__=~".*your_metric.*"}
Alerts Not Firing
Symptom: Alert conditions met but no notifications
Check 1: Alert state in Prometheus
- Go to
:9090/alertsand check alert state
Check 2: Alertmanager receiving alerts
- Go to
:9093and check if alerts appear
Check 3: Alertmanager config
# Validate config
amtool check-config alertmanager.yml
# Check routing
amtool config routes show --config.file=alertmanager.yml
Check 4: Notification channel
- Test webhook/Slack manually
- Check Alertmanager logs for errors
Quick Reference
Essential Prometheus Endpoints
| Endpoint | Purpose |
|---|---|
/metrics |
Prometheus’s own metrics |
/targets |
Scrape target status |
/alerts |
Active alerts |
/graph |
Query interface |
/config |
Current configuration |
/rules |
Loaded rules |
/-/reload |
Reload config (POST) |
/-/healthy |
Health check |
Common PromQL Patterns
# Rate of counter over time
rate(metric_total[5m])
# Percentage
(a / b) * 100
# Top N
topk(5, metric)
# Percentile from histogram
histogram_quantile(0.99, sum(rate(metric_bucket[5m])) by (le))
# Increase over time period
increase(metric_total[1h])
# Average across instances
avg by (label) (metric)
# Absent (for alerting on missing metrics)
absent(up{job="my-job"})
Useful Grafana Variables
Add these as dashboard variables for dynamic filtering:
# Instance selector
label_values(up, instance)
# Job selector
label_values(up, job)
# Namespace selector (Kubernetes)
label_values(kube_pod_info, namespace)
Conclusion
Monitoring is a journey, not a destination. Start simple:
- Deploy the stack (Docker Compose or Helm)
- Import community dashboards (Node Exporter, your platform)
- Set up basic alerts (instance down, high resource usage)
- Instrument your applications (add /metrics endpoints)
- Build custom dashboards for your specific needs
- Iterate based on incidents and questions
The best monitoring setup is one that helps you answer: “Is my system healthy right now?” and “What went wrong yesterday at 3 PM?”
Further Reading
- Prometheus Documentation
- Grafana Documentation
- Awesome Prometheus Alerts - Pre-built alert rules
- Grafana Dashboard Library - Community dashboards
- Google SRE Book - Monitoring Chapter