How do I install Postmortem Writing?

Install Postmortem Writing with a single command: npx mdskills install sickn33/postmortem-writing. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Postmortem Writing?

Postmortem Writing works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Postmortem Writing

Name: Postmortem Writing: AI Agent Skill
Rating: 8 (1 reviews)
Author: sickn33

Verified

Intermediate

Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/postmortem-writing

Fork & Edit

Skill Advisor8.0

Provides comprehensive templates, facilitation guidance, and blameless culture principles for incident reviews

+Delivers three well-structured postmortem templates covering different incident types and depths
+Emphasizes blameless culture with clear before/after comparisons and meeting facilitation tips
+Includes practical examples with timelines, 5 Whys analysis, and anti-patterns to avoid
-Declares shell execution and network permissions without clear justification in the skill content

SKILL.md

Edit in Browser

1---
2name: postmortem-writing
3description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
4---
5 
6# Postmortem Writing
7 
8Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
9 
10## Do not use this skill when
11 
12- The task is unrelated to postmortem writing
13- You need a different domain or tool outside this scope
14 
15## Instructions
16 
17- Clarify goals, constraints, and required inputs.
18- Apply relevant best practices and validate outcomes.
19- Provide actionable steps and verification.
20- If detailed examples are required, open `resources/implementation-playbook.md`.
21 
22## Use this skill when
23 
24- Conducting post-incident reviews
25- Writing postmortem documents
26- Facilitating blameless postmortem meetings
27- Identifying root causes and contributing factors
28- Creating actionable follow-up items
29- Building organizational learning culture
30 
31## Core Concepts
32 
33### 1. Blameless Culture
34 
35| Blame-Focused | Blameless |
36|---------------|-----------|
37| "Who caused this?" | "What conditions allowed this?" |
38| "Someone made a mistake" | "The system allowed this mistake" |
39| Punish individuals | Improve systems |
40| Hide information | Share learnings |
41| Fear of speaking up | Psychological safety |
42 
43### 2. Postmortem Triggers
44 
45- SEV1 or SEV2 incidents
46- Customer-facing outages > 15 minutes
47- Data loss or security incidents
48- Near-misses that could have been severe
49- Novel failure modes
50- Incidents requiring unusual intervention
51 
52## Quick Start
53 
54### Postmortem Timeline
55```
56Day 0: Incident occurs
57Day 1-2: Draft postmortem document
58Day 3-5: Postmortem meeting
59Day 5-7: Finalize document, create tickets
60Week 2+: Action item completion
61Quarterly: Review patterns across incidents
62```
63 
64## Templates
65 
66### Template 1: Standard Postmortem
67 
68```markdown
69# Postmortem: [Incident Title]
70 
71**Date**: 2024-01-15
72**Authors**: @alice, @bob
73**Status**: Draft | In Review | Final
74**Incident Severity**: SEV2
75**Incident Duration**: 47 minutes
76 
77## Executive Summary
78 
79On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
80 
81**Impact**:
82- 12,000 customers unable to complete purchases
83- Estimated revenue loss: $45,000
84- 847 support tickets created
85- No data loss or security implications
86 
87## Timeline (All times UTC)
88 
89| Time | Event |
90|------|-------|
91| 14:23 | Deployment v2.3.4 completed to production |
92| 14:31 | First alert: `payment_error_rate > 5%` |
93| 14:33 | On-call engineer @alice acknowledges alert |
94| 14:35 | Initial investigation begins, error rate at 23% |
95| 14:41 | Incident declared SEV2, @bob joins |
96| 14:45 | Database connection exhaustion identified |
97| 14:52 | Decision to rollback deployment |
98| 14:58 | Rollback to v2.3.3 initiated |
99| 15:10 | Rollback complete, error rate dropping |
100| 15:18 | Service fully recovered, incident resolved |
101 
102## Root Cause Analysis
103 
104### What Happened
105 
106The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
107 
108### Why It Happened
109 
1101. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
111 
1122. **Contributing Factors**:
113   - Code review did not catch the connection handling change
114   - No integration tests specifically for connection pool behavior
115   - Staging environment has lower traffic, masking the issue
116   - Database connection metrics alert threshold was too high (90%)
117 
1183. **5 Whys Analysis**:
119   - Why did the service fail? → Database connections exhausted
120   - Why were connections exhausted? → Each request opened new connection
121   - Why did each request open new connection? → Code bypassed connection pool
122   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
123   - Why was developer unfamiliar? → No documentation on connection management patterns
124 
125### System Diagram
126 
127```
128[Client] → [Load Balancer] → [Payment Service] → [Database]
129                                    ↓
130                            Connection Pool (broken)
131                                    ↓
132                            Direct connections (cause)
133```
134 
135## Detection
136 
137### What Worked
138- Error rate alert fired within 8 minutes of deployment
139- Grafana dashboard clearly showed connection spike
140- On-call response was swift (2 minute acknowledgment)
141 
142### What Didn't Work
143- Database connection metric alert threshold too high
144- No deployment-correlated alerting
145- Canary deployment would have caught this earlier
146 
147### Detection Gap
148The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
149 
150## Response
151 
152### What Worked
153- On-call engineer quickly identified database as the issue
154- Rollback decision was made decisively
155- Clear communication in incident channel
156 
157### What Could Be Improved
158- Took 10 minutes to correlate issue with recent deployment
159- Had to manually check deployment history
160- Rollback took 12 minutes (could be faster)
161 
162## Impact
163 
164### Customer Impact
165- 12,000 unique customers affected
166- Average impact duration: 35 minutes
167- 847 support tickets (23% of affected users)
168- Customer satisfaction score dropped 12 points
169 
170### Business Impact
171- Estimated revenue loss: $45,000
172- Support cost: ~$2,500 (agent time)
173- Engineering time: ~8 person-hours
174 
175### Technical Impact
176- Database primary experienced elevated load
177- Some replica lag during incident
178- No permanent damage to systems
179 
180## Lessons Learned
181 
182### What Went Well
1831. Alerting detected the issue before customer reports
1842. Team collaborated effectively under pressure
1853. Rollback procedure worked smoothly
1864. Communication was clear and timely
187 
188### What Went Wrong
1891. Code review missed critical change
1902. Test coverage gap for connection pooling
1913. Staging environment doesn't reflect production traffic
1924. Alert thresholds were not tuned properly
193 
194### Where We Got Lucky
1951. Incident occurred during business hours with full team available
1962. Database handled the load without failing completely
1973. No other incidents occurred simultaneously
198 
199## Action Items
200 
201| Priority | Action | Owner | Due Date | Ticket |
202|----------|--------|-------|----------|--------|
203| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
204| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
205| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
206| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
207| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
208| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
209 
210## Appendix
211 
212### Supporting Data
213 
214#### Error Rate Graph
215[Link to Grafana dashboard snapshot]
216 
217#### Database Connection Graph
218[Link to metrics]
219 
220### Related Incidents
221- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
222 
223### References
224- [Connection Pool Best Practices](internal-wiki/connection-pools)
225- [Deployment Runbook](internal-wiki/deployment-runbook)
226```
227 
228### Template 2: 5 Whys Analysis
229 
230```markdown
231# 5 Whys Analysis: [Incident]
232 
233## Problem Statement
234Payment service experienced 47-minute outage due to database connection exhaustion.
235 
236## Analysis
237 
238### Why #1: Why did the service fail?
239**Answer**: Database connections were exhausted, causing all new requests to fail.
240 
241**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
242 
243---
244 
245### Why #2: Why were database connections exhausted?
246**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
247 
248**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
249 
250---
251 
252### Why #3: Why did the code bypass the connection pool?
253**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
254 
255**Evidence**: PR #1234 shows the change, made while fixing a different bug.
256 
257---
258 
259### Why #4: Why wasn't this caught in code review?
260**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
261 
262**Evidence**: Review comments only discuss business logic.
263 
264---
265 
266### Why #5: Why isn't there a safety net for this type of change?
267**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
268 
269**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
270 
271## Root Causes Identified
272 
2731. **Primary**: Missing automated tests for infrastructure behavior
2742. **Secondary**: Insufficient documentation of architectural patterns
2753. **Tertiary**: Code review checklist doesn't include infrastructure considerations
276 
277## Systemic Improvements
278 
279| Root Cause | Improvement | Type |
280|------------|-------------|------|
281| Missing tests | Add infrastructure behavior tests | Prevention |
282| Missing docs | Document connection patterns | Prevention |
283| Review gaps | Update review checklist | Detection |
284| No canary | Implement canary deployments | Mitigation |
285```
286 
287### Template 3: Quick Postmortem (Minor Incidents)
288 
289```markdown
290# Quick Postmortem: [Brief Title]
291 
292**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3
293 
294## What Happened
295API latency spiked to 5s due to cache miss storm after cache flush.
296 
297## Timeline
298- 10:00 - Cache flush initiated for config update
299- 10:02 - Latency alerts fire
300- 10:05 - Identified as cache miss storm
301- 10:08 - Enabled cache warming
302- 10:12 - Latency normalized
303 
304## Root Cause
305Full cache flush for minor config update caused thundering herd.
306 
307## Fix
308- Immediate: Enabled cache warming
309- Long-term: Implement partial cache invalidation (ENG-999)
310 
311## Lessons
312Don't full-flush cache in production; use targeted invalidation.
313```
314 
315## Facilitation Guide
316 
317### Running a Postmortem Meeting
318 
319```markdown
320## Meeting Structure (60 minutes)
321 
322### 1. Opening (5 min)
323- Remind everyone of blameless culture
324- "We're here to learn, not to blame"
325- Review meeting norms
326 
327### 2. Timeline Review (15 min)
328- Walk through events chronologically
329- Ask clarifying questions
330- Identify gaps in timeline
331 
332### 3. Analysis Discussion (20 min)
333- What failed?
334- Why did it fail?
335- What conditions allowed this?
336- What would have prevented it?
337 
338### 4. Action Items (15 min)
339- Brainstorm improvements
340- Prioritize by impact and effort
341- Assign owners and due dates
342 
343### 5. Closing (5 min)
344- Summarize key learnings
345- Confirm action item owners
346- Schedule follow-up if needed
347 
348## Facilitation Tips
349- Keep discussion on track
350- Redirect blame to systems
351- Encourage quiet participants
352- Document dissenting views
353- Time-box tangents
354```
355 
356## Anti-Patterns to Avoid
357 
358| Anti-Pattern | Problem | Better Approach |
359|--------------|---------|-----------------|
360| **Blame game** | Shuts down learning | Focus on systems |
361| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |
362| **No action items** | Waste of time | Always have concrete next steps |
363| **Unrealistic actions** | Never completed | Scope to achievable tasks |
364| **No follow-up** | Actions forgotten | Track in ticketing system |
365 
366## Best Practices
367 
368### Do's
369- **Start immediately** - Memory fades fast
370- **Be specific** - Exact times, exact errors
371- **Include graphs** - Visual evidence
372- **Assign owners** - No orphan action items
373- **Share widely** - Organizational learning
374 
375### Don'ts
376- **Don't name and shame** - Ever
377- **Don't skip small incidents** - They reveal patterns
378- **Don't make it a blame doc** - That kills learning
379- **Don't create busywork** - Actions should be meaningful
380- **Don't skip follow-up** - Verify actions completed
381 
382## Resources
383 
384- [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
385- [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)
386- [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)
387

Full transparency — inspect the skill content before installing.