Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
Add this skill
npx mdskills install sickn33/prometheus-configurationComprehensive Prometheus guide with detailed configs, rules, and multiple deployment patterns
1---2name: prometheus-configuration3description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.4---56# Prometheus Configuration78Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.910## Do not use this skill when1112- The task is unrelated to prometheus configuration13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Purpose2324Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.2526## Use this skill when2728- Set up Prometheus monitoring29- Configure metric scraping30- Create recording rules31- Design alert rules32- Implement service discovery3334## Prometheus Architecture3536```37┌──────────────┐38│ Applications │ ← Instrumented with client libraries39└──────┬───────┘40 │ /metrics endpoint41 ↓42┌──────────────┐43│ Prometheus │ ← Scrapes metrics periodically44│ Server │45└──────┬───────┘46 │47 ├─→ AlertManager (alerts)48 ├─→ Grafana (visualization)49 └─→ Long-term storage (Thanos/Cortex)50```5152## Installation5354### Kubernetes with Helm5556```bash57helm repo add prometheus-community https://prometheus-community.github.io/helm-charts58helm repo update5960helm install prometheus prometheus-community/kube-prometheus-stack \61 --namespace monitoring \62 --create-namespace \63 --set prometheus.prometheusSpec.retention=30d \64 --set prometheus.prometheusSpec.storageVolumeSize=50Gi65```6667### Docker Compose6869```yaml70version: '3.8'71services:72 prometheus:73 image: prom/prometheus:latest74 ports:75 - "9090:9090"76 volumes:77 - ./prometheus.yml:/etc/prometheus/prometheus.yml78 - prometheus-data:/prometheus79 command:80 - '--config.file=/etc/prometheus/prometheus.yml'81 - '--storage.tsdb.path=/prometheus'82 - '--storage.tsdb.retention.time=30d'8384volumes:85 prometheus-data:86```8788## Configuration File8990**prometheus.yml:**91```yaml92global:93 scrape_interval: 15s94 evaluation_interval: 15s95 external_labels:96 cluster: 'production'97 region: 'us-west-2'9899# Alertmanager configuration100alerting:101 alertmanagers:102 - static_configs:103 - targets:104 - alertmanager:9093105106# Load rules files107rule_files:108 - /etc/prometheus/rules/*.yml109110# Scrape configurations111scrape_configs:112 # Prometheus itself113 - job_name: 'prometheus'114 static_configs:115 - targets: ['localhost:9090']116117 # Node exporters118 - job_name: 'node-exporter'119 static_configs:120 - targets:121 - 'node1:9100'122 - 'node2:9100'123 - 'node3:9100'124 relabel_configs:125 - source_labels: [__address__]126 target_label: instance127 regex: '([^:]+)(:[0-9]+)?'128 replacement: '${1}'129130 # Kubernetes pods with annotations131 - job_name: 'kubernetes-pods'132 kubernetes_sd_configs:133 - role: pod134 relabel_configs:135 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]136 action: keep137 regex: true138 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]139 action: replace140 target_label: __metrics_path__141 regex: (.+)142 - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]143 action: replace144 regex: ([^:]+)(?::\d+)?;(\d+)145 replacement: $1:$2146 target_label: __address__147 - source_labels: [__meta_kubernetes_namespace]148 action: replace149 target_label: namespace150 - source_labels: [__meta_kubernetes_pod_name]151 action: replace152 target_label: pod153154 # Application metrics155 - job_name: 'my-app'156 static_configs:157 - targets:158 - 'app1.example.com:9090'159 - 'app2.example.com:9090'160 metrics_path: '/metrics'161 scheme: 'https'162 tls_config:163 ca_file: /etc/prometheus/ca.crt164 cert_file: /etc/prometheus/client.crt165 key_file: /etc/prometheus/client.key166```167168**Reference:** See `assets/prometheus.yml.template`169170## Scrape Configurations171172### Static Targets173174```yaml175scrape_configs:176 - job_name: 'static-targets'177 static_configs:178 - targets: ['host1:9100', 'host2:9100']179 labels:180 env: 'production'181 region: 'us-west-2'182```183184### File-based Service Discovery185186```yaml187scrape_configs:188 - job_name: 'file-sd'189 file_sd_configs:190 - files:191 - /etc/prometheus/targets/*.json192 - /etc/prometheus/targets/*.yml193 refresh_interval: 5m194```195196**targets/production.json:**197```json198[199 {200 "targets": ["app1:9090", "app2:9090"],201 "labels": {202 "env": "production",203 "service": "api"204 }205 }206]207```208209### Kubernetes Service Discovery210211```yaml212scrape_configs:213 - job_name: 'kubernetes-services'214 kubernetes_sd_configs:215 - role: service216 relabel_configs:217 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]218 action: keep219 regex: true220 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]221 action: replace222 target_label: __scheme__223 regex: (https?)224 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]225 action: replace226 target_label: __metrics_path__227 regex: (.+)228```229230**Reference:** See `references/scrape-configs.md`231232## Recording Rules233234Create pre-computed metrics for frequently queried expressions:235236```yaml237# /etc/prometheus/rules/recording_rules.yml238groups:239 - name: api_metrics240 interval: 15s241 rules:242 # HTTP request rate per service243 - record: job:http_requests:rate5m244 expr: sum by (job) (rate(http_requests_total[5m]))245246 # Error rate percentage247 - record: job:http_requests_errors:rate5m248 expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))249250 - record: job:http_requests_error_rate:percentage251 expr: |252 (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100253254 # P95 latency255 - record: job:http_request_duration:p95256 expr: |257 histogram_quantile(0.95,258 sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))259 )260261 - name: resource_metrics262 interval: 30s263 rules:264 # CPU utilization percentage265 - record: instance:node_cpu:utilization266 expr: |267 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)268269 # Memory utilization percentage270 - record: instance:node_memory:utilization271 expr: |272 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)273274 # Disk usage percentage275 - record: instance:node_disk:utilization276 expr: |277 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)278```279280**Reference:** See `references/recording-rules.md`281282## Alert Rules283284```yaml285# /etc/prometheus/rules/alert_rules.yml286groups:287 - name: availability288 interval: 30s289 rules:290 - alert: ServiceDown291 expr: up{job="my-app"} == 0292 for: 1m293 labels:294 severity: critical295 annotations:296 summary: "Service {{ $labels.instance }} is down"297 description: "{{ $labels.job }} has been down for more than 1 minute"298299 - alert: HighErrorRate300 expr: job:http_requests_error_rate:percentage > 5301 for: 5m302 labels:303 severity: warning304 annotations:305 summary: "High error rate for {{ $labels.job }}"306 description: "Error rate is {{ $value }}% (threshold: 5%)"307308 - alert: HighLatency309 expr: job:http_request_duration:p95 > 1310 for: 5m311 labels:312 severity: warning313 annotations:314 summary: "High latency for {{ $labels.job }}"315 description: "P95 latency is {{ $value }}s (threshold: 1s)"316317 - name: resources318 interval: 1m319 rules:320 - alert: HighCPUUsage321 expr: instance:node_cpu:utilization > 80322 for: 5m323 labels:324 severity: warning325 annotations:326 summary: "High CPU usage on {{ $labels.instance }}"327 description: "CPU usage is {{ $value }}%"328329 - alert: HighMemoryUsage330 expr: instance:node_memory:utilization > 85331 for: 5m332 labels:333 severity: warning334 annotations:335 summary: "High memory usage on {{ $labels.instance }}"336 description: "Memory usage is {{ $value }}%"337338 - alert: DiskSpaceLow339 expr: instance:node_disk:utilization > 90340 for: 5m341 labels:342 severity: critical343 annotations:344 summary: "Low disk space on {{ $labels.instance }}"345 description: "Disk usage is {{ $value }}%"346```347348## Validation349350```bash351# Validate configuration352promtool check config prometheus.yml353354# Validate rules355promtool check rules /etc/prometheus/rules/*.yml356357# Test query358promtool query instant http://localhost:9090 'up'359```360361**Reference:** See `scripts/validate-prometheus.sh`362363## Best Practices3643651. **Use consistent naming** for metrics (prefix_name_unit)3662. **Set appropriate scrape intervals** (15-60s typical)3673. **Use recording rules** for expensive queries3684. **Implement high availability** (multiple Prometheus instances)3695. **Configure retention** based on storage capacity3706. **Use relabeling** for metric cleanup3717. **Monitor Prometheus itself**3728. **Implement federation** for large deployments3739. **Use Thanos/Cortex** for long-term storage37410. **Document custom metrics**375376## Troubleshooting377378**Check scrape targets:**379```bash380curl http://localhost:9090/api/v1/targets381```382383**Check configuration:**384```bash385curl http://localhost:9090/api/v1/status/config386```387388**Test query:**389```bash390curl 'http://localhost:9090/api/v1/query?query=up'391```392393## Reference Files394395- `assets/prometheus.yml.template` - Complete configuration template396- `references/scrape-configs.md` - Scrape configuration patterns397- `references/recording-rules.md` - Recording rule examples398- `scripts/validate-prometheus.sh` - Validation script399400## Related Skills401402- `grafana-dashboards` - For visualization403- `slo-implementation` - For SLO monitoring404- `distributed-tracing` - For request tracing405
Full transparency — inspect the skill content before installing.