Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.
Add this skill
npx mdskills install sickn33/postmortem-writingProvides comprehensive templates, facilitation guidance, and blameless culture principles for incident reviews
1---2name: postmortem-writing3description: Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.4---56# Postmortem Writing78Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.910## Do not use this skill when1112- The task is unrelated to postmortem writing13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Use this skill when2324- Conducting post-incident reviews25- Writing postmortem documents26- Facilitating blameless postmortem meetings27- Identifying root causes and contributing factors28- Creating actionable follow-up items29- Building organizational learning culture3031## Core Concepts3233### 1. Blameless Culture3435| Blame-Focused | Blameless |36|---------------|-----------|37| "Who caused this?" | "What conditions allowed this?" |38| "Someone made a mistake" | "The system allowed this mistake" |39| Punish individuals | Improve systems |40| Hide information | Share learnings |41| Fear of speaking up | Psychological safety |4243### 2. Postmortem Triggers4445- SEV1 or SEV2 incidents46- Customer-facing outages > 15 minutes47- Data loss or security incidents48- Near-misses that could have been severe49- Novel failure modes50- Incidents requiring unusual intervention5152## Quick Start5354### Postmortem Timeline55```56Day 0: Incident occurs57Day 1-2: Draft postmortem document58Day 3-5: Postmortem meeting59Day 5-7: Finalize document, create tickets60Week 2+: Action item completion61Quarterly: Review patterns across incidents62```6364## Templates6566### Template 1: Standard Postmortem6768```markdown69# Postmortem: [Incident Title]7071**Date**: 2024-01-1572**Authors**: @alice, @bob73**Status**: Draft | In Review | Final74**Incident Severity**: SEV275**Incident Duration**: 47 minutes7677## Executive Summary7879On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.8081**Impact**:82- 12,000 customers unable to complete purchases83- Estimated revenue loss: $45,00084- 847 support tickets created85- No data loss or security implications8687## Timeline (All times UTC)8889| Time | Event |90|------|-------|91| 14:23 | Deployment v2.3.4 completed to production |92| 14:31 | First alert: `payment_error_rate > 5%` |93| 14:33 | On-call engineer @alice acknowledges alert |94| 14:35 | Initial investigation begins, error rate at 23% |95| 14:41 | Incident declared SEV2, @bob joins |96| 14:45 | Database connection exhaustion identified |97| 14:52 | Decision to rollback deployment |98| 14:58 | Rollback to v2.3.3 initiated |99| 15:10 | Rollback complete, error rate dropping |100| 15:18 | Service fully recovered, incident resolved |101102## Root Cause Analysis103104### What Happened105106The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.107108### Why It Happened1091101. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.1111122. **Contributing Factors**:113 - Code review did not catch the connection handling change114 - No integration tests specifically for connection pool behavior115 - Staging environment has lower traffic, masking the issue116 - Database connection metrics alert threshold was too high (90%)1171183. **5 Whys Analysis**:119 - Why did the service fail? → Database connections exhausted120 - Why were connections exhausted? → Each request opened new connection121 - Why did each request open new connection? → Code bypassed connection pool122 - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns123 - Why was developer unfamiliar? → No documentation on connection management patterns124125### System Diagram126127```128[Client] → [Load Balancer] → [Payment Service] → [Database]129 ↓130 Connection Pool (broken)131 ↓132 Direct connections (cause)133```134135## Detection136137### What Worked138- Error rate alert fired within 8 minutes of deployment139- Grafana dashboard clearly showed connection spike140- On-call response was swift (2 minute acknowledgment)141142### What Didn't Work143- Database connection metric alert threshold too high144- No deployment-correlated alerting145- Canary deployment would have caught this earlier146147### Detection Gap148The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.149150## Response151152### What Worked153- On-call engineer quickly identified database as the issue154- Rollback decision was made decisively155- Clear communication in incident channel156157### What Could Be Improved158- Took 10 minutes to correlate issue with recent deployment159- Had to manually check deployment history160- Rollback took 12 minutes (could be faster)161162## Impact163164### Customer Impact165- 12,000 unique customers affected166- Average impact duration: 35 minutes167- 847 support tickets (23% of affected users)168- Customer satisfaction score dropped 12 points169170### Business Impact171- Estimated revenue loss: $45,000172- Support cost: ~$2,500 (agent time)173- Engineering time: ~8 person-hours174175### Technical Impact176- Database primary experienced elevated load177- Some replica lag during incident178- No permanent damage to systems179180## Lessons Learned181182### What Went Well1831. Alerting detected the issue before customer reports1842. Team collaborated effectively under pressure1853. Rollback procedure worked smoothly1864. Communication was clear and timely187188### What Went Wrong1891. Code review missed critical change1902. Test coverage gap for connection pooling1913. Staging environment doesn't reflect production traffic1924. Alert thresholds were not tuned properly193194### Where We Got Lucky1951. Incident occurred during business hours with full team available1962. Database handled the load without failing completely1973. No other incidents occurred simultaneously198199## Action Items200201| Priority | Action | Owner | Due Date | Ticket |202|----------|--------|-------|----------|--------|203| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |204| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |205| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |206| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |207| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |208| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |209210## Appendix211212### Supporting Data213214#### Error Rate Graph215[Link to Grafana dashboard snapshot]216217#### Database Connection Graph218[Link to metrics]219220### Related Incidents221- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)222223### References224- [Connection Pool Best Practices](internal-wiki/connection-pools)225- [Deployment Runbook](internal-wiki/deployment-runbook)226```227228### Template 2: 5 Whys Analysis229230```markdown231# 5 Whys Analysis: [Incident]232233## Problem Statement234Payment service experienced 47-minute outage due to database connection exhaustion.235236## Analysis237238### Why #1: Why did the service fail?239**Answer**: Database connections were exhausted, causing all new requests to fail.240241**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.242243---244245### Why #2: Why were database connections exhausted?246**Answer**: Each incoming request opened a new database connection instead of using the connection pool.247248**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.249250---251252### Why #3: Why did the code bypass the connection pool?253**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.254255**Evidence**: PR #1234 shows the change, made while fixing a different bug.256257---258259### Why #4: Why wasn't this caught in code review?260**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.261262**Evidence**: Review comments only discuss business logic.263264---265266### Why #5: Why isn't there a safety net for this type of change?267**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.268269**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.270271## Root Causes Identified2722731. **Primary**: Missing automated tests for infrastructure behavior2742. **Secondary**: Insufficient documentation of architectural patterns2753. **Tertiary**: Code review checklist doesn't include infrastructure considerations276277## Systemic Improvements278279| Root Cause | Improvement | Type |280|------------|-------------|------|281| Missing tests | Add infrastructure behavior tests | Prevention |282| Missing docs | Document connection patterns | Prevention |283| Review gaps | Update review checklist | Detection |284| No canary | Implement canary deployments | Mitigation |285```286287### Template 3: Quick Postmortem (Minor Incidents)288289```markdown290# Quick Postmortem: [Brief Title]291292**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3293294## What Happened295API latency spiked to 5s due to cache miss storm after cache flush.296297## Timeline298- 10:00 - Cache flush initiated for config update299- 10:02 - Latency alerts fire300- 10:05 - Identified as cache miss storm301- 10:08 - Enabled cache warming302- 10:12 - Latency normalized303304## Root Cause305Full cache flush for minor config update caused thundering herd.306307## Fix308- Immediate: Enabled cache warming309- Long-term: Implement partial cache invalidation (ENG-999)310311## Lessons312Don't full-flush cache in production; use targeted invalidation.313```314315## Facilitation Guide316317### Running a Postmortem Meeting318319```markdown320## Meeting Structure (60 minutes)321322### 1. Opening (5 min)323- Remind everyone of blameless culture324- "We're here to learn, not to blame"325- Review meeting norms326327### 2. Timeline Review (15 min)328- Walk through events chronologically329- Ask clarifying questions330- Identify gaps in timeline331332### 3. Analysis Discussion (20 min)333- What failed?334- Why did it fail?335- What conditions allowed this?336- What would have prevented it?337338### 4. Action Items (15 min)339- Brainstorm improvements340- Prioritize by impact and effort341- Assign owners and due dates342343### 5. Closing (5 min)344- Summarize key learnings345- Confirm action item owners346- Schedule follow-up if needed347348## Facilitation Tips349- Keep discussion on track350- Redirect blame to systems351- Encourage quiet participants352- Document dissenting views353- Time-box tangents354```355356## Anti-Patterns to Avoid357358| Anti-Pattern | Problem | Better Approach |359|--------------|---------|-----------------|360| **Blame game** | Shuts down learning | Focus on systems |361| **Shallow analysis** | Doesn't prevent recurrence | Ask "why" 5 times |362| **No action items** | Waste of time | Always have concrete next steps |363| **Unrealistic actions** | Never completed | Scope to achievable tasks |364| **No follow-up** | Actions forgotten | Track in ticketing system |365366## Best Practices367368### Do's369- **Start immediately** - Memory fades fast370- **Be specific** - Exact times, exact errors371- **Include graphs** - Visual evidence372- **Assign owners** - No orphan action items373- **Share widely** - Organizational learning374375### Don'ts376- **Don't name and shame** - Ever377- **Don't skip small incidents** - They reveal patterns378- **Don't make it a blame doc** - That kills learning379- **Don't create busywork** - Actions should be meaningful380- **Don't skip follow-up** - Verify actions completed381382## Resources383384- [Google SRE - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)385- [Etsy's Blameless Postmortems](https://codeascraft.com/2012/05/22/blameless-postmortems/)386- [PagerDuty Postmortem Guide](https://postmortems.pagerduty.com/)387
Full transparency — inspect the skill content before installing.