How do I install Incident Runbook Templates?

Install Incident Runbook Templates with a single command: npx mdskills install sickn33/incident-runbook-templates. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Incident Runbook Templates?

Incident Runbook Templates works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Incident Runbook Templates

Name: Incident Runbook Templates: AI Agent Skill
Rating: 8 (1 reviews)
Author: sickn33

Verified

Intermediate

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/incident-runbook-templates

Fork & Edit

Skill Advisor8.0

Comprehensive incident response templates with excellent structure, commands, and communication examples

+Provides detailed, actionable runbooks with real commands and verification steps
+Includes communication templates, escalation paths, and severity classification
+Covers multiple incident types with specific troubleshooting procedures
-Declares shell/network permissions but skill generates templates, not executes commands

SKILL.md

Edit in Browser

1---
2name: incident-runbook-templates
3description: Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
4---
5 
6# Incident Runbook Templates
7 
8Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
9 
10## Do not use this skill when
11 
12- The task is unrelated to incident runbook templates
13- You need a different domain or tool outside this scope
14 
15## Instructions
16 
17- Clarify goals, constraints, and required inputs.
18- Apply relevant best practices and validate outcomes.
19- Provide actionable steps and verification.
20- If detailed examples are required, open `resources/implementation-playbook.md`.
21 
22## Use this skill when
23 
24- Creating incident response procedures
25- Building service-specific runbooks
26- Establishing escalation paths
27- Documenting recovery procedures
28- Responding to active incidents
29- Onboarding on-call engineers
30 
31## Core Concepts
32 
33### 1. Incident Severity Levels
34 
35| Severity | Impact | Response Time | Example |
36|----------|--------|---------------|---------|
37| **SEV1** | Complete outage, data loss | 15 min | Production down |
38| **SEV2** | Major degradation | 30 min | Critical feature broken |
39| **SEV3** | Minor impact | 2 hours | Non-critical bug |
40| **SEV4** | Minimal impact | Next business day | Cosmetic issue |
41 
42### 2. Runbook Structure
43 
44```
451. Overview & Impact
462. Detection & Alerts
473. Initial Triage
484. Mitigation Steps
495. Root Cause Investigation
506. Resolution Procedures
517. Verification & Rollback
528. Communication Templates
539. Escalation Matrix
54```
55 
56## Runbook Templates
57 
58### Template 1: Service Outage Runbook
59 
60```markdown
61# [Service Name] Outage Runbook
62 
63## Overview
64**Service**: Payment Processing Service
65**Owner**: Platform Team
66**Slack**: #payments-incidents
67**PagerDuty**: payments-oncall
68 
69## Impact Assessment
70- [ ] Which customers are affected?
71- [ ] What percentage of traffic is impacted?
72- [ ] Are there financial implications?
73- [ ] What's the blast radius?
74 
75## Detection
76### Alerts
77- `payment_error_rate > 5%` (PagerDuty)
78- `payment_latency_p99 > 2s` (Slack)
79- `payment_success_rate < 95%` (PagerDuty)
80 
81### Dashboards
82- [Payment Service Dashboard](https://grafana/d/payments)
83- [Error Tracking](https://sentry.io/payments)
84- [Dependency Status](https://status.stripe.com)
85 
86## Initial Triage (First 5 Minutes)
87 
88### 1. Assess Scope
89```bash
90# Check service health
91kubectl get pods -n payments -l app=payment-service
92 
93# Check recent deployments
94kubectl rollout history deployment/payment-service -n payments
95 
96# Check error rates
97curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
98```
99 
100### 2. Quick Health Checks
101- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
102- [ ] Database connectivity? Check connection pool metrics
103- [ ] External dependencies? Check Stripe, bank API status
104- [ ] Recent changes? Check deploy history
105 
106### 3. Initial Classification
107| Symptom | Likely Cause | Go To Section |
108|---------|--------------|---------------|
109| All requests failing | Service down | Section 4.1 |
110| High latency | Database/dependency | Section 4.2 |
111| Partial failures | Code bug | Section 4.3 |
112| Spike in errors | Traffic surge | Section 4.4 |
113 
114## Mitigation Procedures
115 
116### 4.1 Service Completely Down
117```bash
118# Step 1: Check pod status
119kubectl get pods -n payments
120 
121# Step 2: If pods are crash-looping, check logs
122kubectl logs -n payments -l app=payment-service --tail=100
123 
124# Step 3: Check recent deployments
125kubectl rollout history deployment/payment-service -n payments
126 
127# Step 4: ROLLBACK if recent deploy is suspect
128kubectl rollout undo deployment/payment-service -n payments
129 
130# Step 5: Scale up if resource constrained
131kubectl scale deployment/payment-service -n payments --replicas=10
132 
133# Step 6: Verify recovery
134kubectl rollout status deployment/payment-service -n payments
135```
136 
137### 4.2 High Latency
138```bash
139# Step 1: Check database connections
140kubectl exec -n payments deploy/payment-service -- \
141  curl localhost:8080/metrics | grep db_pool
142 
143# Step 2: Check slow queries (if DB issue)
144psql -h $DB_HOST -U $DB_USER -c "
145  SELECT pid, now() - query_start AS duration, query
146  FROM pg_stat_activity
147  WHERE state = 'active' AND duration > interval '5 seconds'
148  ORDER BY duration DESC;"
149 
150# Step 3: Kill long-running queries if needed
151psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
152 
153# Step 4: Check external dependency latency
154curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
155 
156# Step 5: Enable circuit breaker if dependency is slow
157kubectl set env deployment/payment-service \
158  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
159```
160 
161### 4.3 Partial Failures (Specific Errors)
162```bash
163# Step 1: Identify error pattern
164kubectl logs -n payments -l app=payment-service --tail=500 | \
165  grep -i error | sort | uniq -c | sort -rn | head -20
166 
167# Step 2: Check error tracking
168# Go to Sentry: https://sentry.io/payments
169 
170# Step 3: If specific endpoint, enable feature flag to disable
171curl -X POST https://api.company.com/internal/feature-flags \
172  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
173 
174# Step 4: If data issue, check recent data changes
175psql -h $DB_HOST -c "
176  SELECT * FROM audit_log
177  WHERE table_name = 'payment_methods'
178  AND created_at > now() - interval '1 hour';"
179```
180 
181### 4.4 Traffic Surge
182```bash
183# Step 1: Check current request rate
184kubectl top pods -n payments
185 
186# Step 2: Scale horizontally
187kubectl scale deployment/payment-service -n payments --replicas=20
188 
189# Step 3: Enable rate limiting
190kubectl set env deployment/payment-service \
191  RATE_LIMIT_ENABLED=true \
192  RATE_LIMIT_RPS=1000 -n payments
193 
194# Step 4: If attack, block suspicious IPs
195kubectl apply -f - <<EOF
196apiVersion: networking.k8s.io/v1
197kind: NetworkPolicy
198metadata:
199  name: block-suspicious
200  namespace: payments
201spec:
202  podSelector:
203    matchLabels:
204      app: payment-service
205  ingress:
206  - from:
207    - ipBlock:
208        cidr: 0.0.0.0/0
209        except:
210        - 192.168.1.0/24  # Suspicious range
211EOF
212```
213 
214## Verification Steps
215```bash
216# Verify service is healthy
217curl -s https://api.company.com/payments/health | jq
218 
219# Verify error rate is back to normal
220curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
221 
222# Verify latency is acceptable
223curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
224 
225# Smoke test critical flows
226./scripts/smoke-test-payments.sh
227```
228 
229## Rollback Procedures
230```bash
231# Rollback Kubernetes deployment
232kubectl rollout undo deployment/payment-service -n payments
233 
234# Rollback database migration (if applicable)
235./scripts/db-rollback.sh $MIGRATION_VERSION
236 
237# Rollback feature flag
238curl -X POST https://api.company.com/internal/feature-flags \
239  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
240```
241 
242## Escalation Matrix
243 
244| Condition | Escalate To | Contact |
245|-----------|-------------|---------|
246| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
247| Data breach suspected | Security Team | #security-incidents |
248| Financial impact > $10k | Finance + Legal | @finance-oncall |
249| Customer communication needed | Support Lead | @support-lead |
250 
251## Communication Templates
252 
253### Initial Notification (Internal)
254```
255🚨 INCIDENT: Payment Service Degradation
256 
257Severity: SEV2
258Status: Investigating
259Impact: ~20% of payment requests failing
260Start Time: [TIME]
261Incident Commander: [NAME]
262 
263Current Actions:
264- Investigating root cause
265- Scaling up service
266- Monitoring dashboards
267 
268Updates in #payments-incidents
269```
270 
271### Status Update
272```
273📊 UPDATE: Payment Service Incident
274 
275Status: Mitigating
276Impact: Reduced to ~5% failure rate
277Duration: 25 minutes
278 
279Actions Taken:
280- Rolled back deployment v2.3.4 → v2.3.3
281- Scaled service from 5 → 10 replicas
282 
283Next Steps:
284- Continuing to monitor
285- Root cause analysis in progress
286 
287ETA to Resolution: ~15 minutes
288```
289 
290### Resolution Notification
291```
292✅ RESOLVED: Payment Service Incident
293 
294Duration: 45 minutes
295Impact: ~5,000 affected transactions
296Root Cause: Memory leak in v2.3.4
297 
298Resolution:
299- Rolled back to v2.3.3
300- Transactions auto-retried successfully
301 
302Follow-up:
303- Postmortem scheduled for [DATE]
304- Bug fix in progress
305```
306```
307 
308### Template 2: Database Incident Runbook
309 
310```markdown
311# Database Incident Runbook
312 
313## Quick Reference
314| Issue | Command |
315|-------|---------|
316| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
317| Kill query | `SELECT pg_terminate_backend(pid);` |
318| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
319| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
320 
321## Connection Pool Exhaustion
322```sql
323-- Check current connections
324SELECT datname, usename, state, count(*)
325FROM pg_stat_activity
326GROUP BY datname, usename, state
327ORDER BY count(*) DESC;
328 
329-- Identify long-running connections
330SELECT pid, usename, datname, state, query_start, query
331FROM pg_stat_activity
332WHERE state != 'idle'
333ORDER BY query_start;
334 
335-- Terminate idle connections
336SELECT pg_terminate_backend(pid)
337FROM pg_stat_activity
338WHERE state = 'idle'
339AND query_start < now() - interval '10 minutes';
340```
341 
342## Replication Lag
343```sql
344-- Check lag on replica
345SELECT
346  CASE
347    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
348    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
349  END AS lag_seconds;
350 
351-- If lag > 60s, consider:
352-- 1. Check network between primary/replica
353-- 2. Check replica disk I/O
354-- 3. Consider failover if unrecoverable
355```
356 
357## Disk Space Critical
358```bash
359# Check disk usage
360df -h /var/lib/postgresql/data
361 
362# Find large tables
363psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
364FROM pg_catalog.pg_statio_user_tables
365ORDER BY pg_total_relation_size(relid) DESC
366LIMIT 10;"
367 
368# VACUUM to reclaim space
369psql -c "VACUUM FULL large_table;"
370 
371# If emergency, delete old data or expand disk
372```
373```
374 
375## Best Practices
376 
377### Do's
378- **Keep runbooks updated** - Review after every incident
379- **Test runbooks regularly** - Game days, chaos engineering
380- **Include rollback steps** - Always have an escape hatch
381- **Document assumptions** - What must be true for steps to work
382- **Link to dashboards** - Quick access during stress
383 
384### Don'ts
385- **Don't assume knowledge** - Write for 3 AM brain
386- **Don't skip verification** - Confirm each step worked
387- **Don't forget communication** - Keep stakeholders informed
388- **Don't work alone** - Escalate early
389- **Don't skip postmortems** - Learn from every incident
390 
391## Resources
392 
393- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
394- [PagerDuty Incident Response](https://response.pagerduty.com/)
395- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
396

Full transparency — inspect the skill content before installing.