Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
Add this skill
npx mdskills install sickn33/incident-runbook-templatesComprehensive incident response templates with excellent structure, commands, and communication examples
1---2name: incident-runbook-templates3description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.4---56# Incident Runbook Templates78Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.910## Do not use this skill when1112- The task is unrelated to incident runbook templates13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Use this skill when2324- Creating incident response procedures25- Building service-specific runbooks26- Establishing escalation paths27- Documenting recovery procedures28- Responding to active incidents29- Onboarding on-call engineers3031## Core Concepts3233### 1. Incident Severity Levels3435| Severity | Impact | Response Time | Example |36|----------|--------|---------------|---------|37| **SEV1** | Complete outage, data loss | 15 min | Production down |38| **SEV2** | Major degradation | 30 min | Critical feature broken |39| **SEV3** | Minor impact | 2 hours | Non-critical bug |40| **SEV4** | Minimal impact | Next business day | Cosmetic issue |4142### 2. Runbook Structure4344```451. Overview & Impact462. Detection & Alerts473. Initial Triage484. Mitigation Steps495. Root Cause Investigation506. Resolution Procedures517. Verification & Rollback528. Communication Templates539. Escalation Matrix54```5556## Runbook Templates5758### Template 1: Service Outage Runbook5960```markdown61# [Service Name] Outage Runbook6263## Overview64**Service**: Payment Processing Service65**Owner**: Platform Team66**Slack**: #payments-incidents67**PagerDuty**: payments-oncall6869## Impact Assessment70- [ ] Which customers are affected?71- [ ] What percentage of traffic is impacted?72- [ ] Are there financial implications?73- [ ] What's the blast radius?7475## Detection76### Alerts77- `payment_error_rate > 5%` (PagerDuty)78- `payment_latency_p99 > 2s` (Slack)79- `payment_success_rate < 95%` (PagerDuty)8081### Dashboards82- [Payment Service Dashboard](https://grafana/d/payments)83- [Error Tracking](https://sentry.io/payments)84- [Dependency Status](https://status.stripe.com)8586## Initial Triage (First 5 Minutes)8788### 1. Assess Scope89```bash90# Check service health91kubectl get pods -n payments -l app=payment-service9293# Check recent deployments94kubectl rollout history deployment/payment-service -n payments9596# Check error rates97curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"98```99100### 2. Quick Health Checks101- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`102- [ ] Database connectivity? Check connection pool metrics103- [ ] External dependencies? Check Stripe, bank API status104- [ ] Recent changes? Check deploy history105106### 3. Initial Classification107| Symptom | Likely Cause | Go To Section |108|---------|--------------|---------------|109| All requests failing | Service down | Section 4.1 |110| High latency | Database/dependency | Section 4.2 |111| Partial failures | Code bug | Section 4.3 |112| Spike in errors | Traffic surge | Section 4.4 |113114## Mitigation Procedures115116### 4.1 Service Completely Down117```bash118# Step 1: Check pod status119kubectl get pods -n payments120121# Step 2: If pods are crash-looping, check logs122kubectl logs -n payments -l app=payment-service --tail=100123124# Step 3: Check recent deployments125kubectl rollout history deployment/payment-service -n payments126127# Step 4: ROLLBACK if recent deploy is suspect128kubectl rollout undo deployment/payment-service -n payments129130# Step 5: Scale up if resource constrained131kubectl scale deployment/payment-service -n payments --replicas=10132133# Step 6: Verify recovery134kubectl rollout status deployment/payment-service -n payments135```136137### 4.2 High Latency138```bash139# Step 1: Check database connections140kubectl exec -n payments deploy/payment-service -- \141 curl localhost:8080/metrics | grep db_pool142143# Step 2: Check slow queries (if DB issue)144psql -h $DB_HOST -U $DB_USER -c "145 SELECT pid, now() - query_start AS duration, query146 FROM pg_stat_activity147 WHERE state = 'active' AND duration > interval '5 seconds'148 ORDER BY duration DESC;"149150# Step 3: Kill long-running queries if needed151psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"152153# Step 4: Check external dependency latency154curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health155156# Step 5: Enable circuit breaker if dependency is slow157kubectl set env deployment/payment-service \158 STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments159```160161### 4.3 Partial Failures (Specific Errors)162```bash163# Step 1: Identify error pattern164kubectl logs -n payments -l app=payment-service --tail=500 | \165 grep -i error | sort | uniq -c | sort -rn | head -20166167# Step 2: Check error tracking168# Go to Sentry: https://sentry.io/payments169170# Step 3: If specific endpoint, enable feature flag to disable171curl -X POST https://api.company.com/internal/feature-flags \172 -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'173174# Step 4: If data issue, check recent data changes175psql -h $DB_HOST -c "176 SELECT * FROM audit_log177 WHERE table_name = 'payment_methods'178 AND created_at > now() - interval '1 hour';"179```180181### 4.4 Traffic Surge182```bash183# Step 1: Check current request rate184kubectl top pods -n payments185186# Step 2: Scale horizontally187kubectl scale deployment/payment-service -n payments --replicas=20188189# Step 3: Enable rate limiting190kubectl set env deployment/payment-service \191 RATE_LIMIT_ENABLED=true \192 RATE_LIMIT_RPS=1000 -n payments193194# Step 4: If attack, block suspicious IPs195kubectl apply -f - <<EOF196apiVersion: networking.k8s.io/v1197kind: NetworkPolicy198metadata:199 name: block-suspicious200 namespace: payments201spec:202 podSelector:203 matchLabels:204 app: payment-service205 ingress:206 - from:207 - ipBlock:208 cidr: 0.0.0.0/0209 except:210 - 192.168.1.0/24 # Suspicious range211EOF212```213214## Verification Steps215```bash216# Verify service is healthy217curl -s https://api.company.com/payments/health | jq218219# Verify error rate is back to normal220curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'221222# Verify latency is acceptable223curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq224225# Smoke test critical flows226./scripts/smoke-test-payments.sh227```228229## Rollback Procedures230```bash231# Rollback Kubernetes deployment232kubectl rollout undo deployment/payment-service -n payments233234# Rollback database migration (if applicable)235./scripts/db-rollback.sh $MIGRATION_VERSION236237# Rollback feature flag238curl -X POST https://api.company.com/internal/feature-flags \239 -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'240```241242## Escalation Matrix243244| Condition | Escalate To | Contact |245|-----------|-------------|---------|246| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |247| Data breach suspected | Security Team | #security-incidents |248| Financial impact > $10k | Finance + Legal | @finance-oncall |249| Customer communication needed | Support Lead | @support-lead |250251## Communication Templates252253### Initial Notification (Internal)254```255๐จ INCIDENT: Payment Service Degradation256257Severity: SEV2258Status: Investigating259Impact: ~20% of payment requests failing260Start Time: [TIME]261Incident Commander: [NAME]262263Current Actions:264- Investigating root cause265- Scaling up service266- Monitoring dashboards267268Updates in #payments-incidents269```270271### Status Update272```273๐ UPDATE: Payment Service Incident274275Status: Mitigating276Impact: Reduced to ~5% failure rate277Duration: 25 minutes278279Actions Taken:280- Rolled back deployment v2.3.4 โ v2.3.3281- Scaled service from 5 โ 10 replicas282283Next Steps:284- Continuing to monitor285- Root cause analysis in progress286287ETA to Resolution: ~15 minutes288```289290### Resolution Notification291```292โ RESOLVED: Payment Service Incident293294Duration: 45 minutes295Impact: ~5,000 affected transactions296Root Cause: Memory leak in v2.3.4297298Resolution:299- Rolled back to v2.3.3300- Transactions auto-retried successfully301302Follow-up:303- Postmortem scheduled for [DATE]304- Bug fix in progress305```306```307308### Template 2: Database Incident Runbook309310```markdown311# Database Incident Runbook312313## Quick Reference314| Issue | Command |315|-------|---------|316| Check connections | `SELECT count(*) FROM pg_stat_activity;` |317| Kill query | `SELECT pg_terminate_backend(pid);` |318| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |319| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |320321## Connection Pool Exhaustion322```sql323-- Check current connections324SELECT datname, usename, state, count(*)325FROM pg_stat_activity326GROUP BY datname, usename, state327ORDER BY count(*) DESC;328329-- Identify long-running connections330SELECT pid, usename, datname, state, query_start, query331FROM pg_stat_activity332WHERE state != 'idle'333ORDER BY query_start;334335-- Terminate idle connections336SELECT pg_terminate_backend(pid)337FROM pg_stat_activity338WHERE state = 'idle'339AND query_start < now() - interval '10 minutes';340```341342## Replication Lag343```sql344-- Check lag on replica345SELECT346 CASE347 WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0348 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())349 END AS lag_seconds;350351-- If lag > 60s, consider:352-- 1. Check network between primary/replica353-- 2. Check replica disk I/O354-- 3. Consider failover if unrecoverable355```356357## Disk Space Critical358```bash359# Check disk usage360df -h /var/lib/postgresql/data361362# Find large tables363psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))364FROM pg_catalog.pg_statio_user_tables365ORDER BY pg_total_relation_size(relid) DESC366LIMIT 10;"367368# VACUUM to reclaim space369psql -c "VACUUM FULL large_table;"370371# If emergency, delete old data or expand disk372```373```374375## Best Practices376377### Do's378- **Keep runbooks updated** - Review after every incident379- **Test runbooks regularly** - Game days, chaos engineering380- **Include rollback steps** - Always have an escape hatch381- **Document assumptions** - What must be true for steps to work382- **Link to dashboards** - Quick access during stress383384### Don'ts385- **Don't assume knowledge** - Write for 3 AM brain386- **Don't skip verification** - Confirm each step worked387- **Don't forget communication** - Keep stakeholders informed388- **Don't work alone** - Escalate early389- **Don't skip postmortems** - Learn from every incident390391## Resources392393- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)394- [PagerDuty Incident Response](https://response.pagerduty.com/)395- [Atlassian Incident Management](https://www.atlassian.com/incident-management)396
Full transparency โ inspect the skill content before installing.