Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.
Add this skill
npx mdskills install sickn33/on-call-handoff-patternsComprehensive on-call handoff guidance with excellent templates and actionable checklists
1---2name: on-call-handoff-patterns3description: Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.4---56# On-Call Handoff Patterns78Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.910## Do not use this skill when1112- The task is unrelated to on-call handoff patterns13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Use this skill when2324- Transitioning on-call responsibilities25- Writing shift handoff summaries26- Documenting ongoing investigations27- Establishing on-call rotation procedures28- Improving handoff quality29- Onboarding new on-call engineers3031## Core Concepts3233### 1. Handoff Components3435| Component | Purpose |36|-----------|---------|37| **Active Incidents** | What's currently broken |38| **Ongoing Investigations** | Issues being debugged |39| **Recent Changes** | Deployments, configs |40| **Known Issues** | Workarounds in place |41| **Upcoming Events** | Maintenance, releases |4243### 2. Handoff Timing4445```46Recommended: 30 min overlap between shifts4748Outgoing:49├── 15 min: Write handoff document50└── 15 min: Sync call with incoming5152Incoming:53├── 15 min: Review handoff document54├── 15 min: Sync call with outgoing55└── 5 min: Verify alerting setup56```5758## Templates5960### Template 1: Shift Handoff Document6162```markdown63# On-Call Handoff: Platform Team6465**Outgoing**: @alice (2024-01-15 to 2024-01-22)66**Incoming**: @bob (2024-01-22 to 2024-01-29)67**Handoff Time**: 2024-01-22 09:00 UTC6869---7071## 🔴 Active Incidents7273### None currently active74No active incidents at handoff time.7576---7778## 🟡 Ongoing Investigations7980### 1. Intermittent API Timeouts (ENG-1234)81**Status**: Investigating82**Started**: 2024-01-2083**Impact**: ~0.1% of requests timing out8485**Context**:86- Timeouts correlate with database backup window (02:00-03:00 UTC)87- Suspect backup process causing lock contention88- Added extra logging in PR #567 (deployed 01/21)8990**Next Steps**:91- [ ] Review new logs after tonight's backup92- [ ] Consider moving backup window if confirmed9394**Resources**:95- Dashboard: [API Latency](https://grafana/d/api-latency)96- Thread: #platform-eng (01/20, 14:32)9798---99100### 2. Memory Growth in Auth Service (ENG-1235)101**Status**: Monitoring102**Started**: 2024-01-18103**Impact**: None yet (proactive)104105**Context**:106- Memory usage growing ~5% per day107- No memory leak found in profiling108- Suspect connection pool not releasing properly109110**Next Steps**:111- [ ] Review heap dump from 01/21112- [ ] Consider restart if usage > 80%113114**Resources**:115- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)116- Analysis doc: [Memory Investigation](https://docs/eng-1235)117118---119120## 🟢 Resolved This Shift121122### Payment Service Outage (2024-01-19)123- **Duration**: 23 minutes124- **Root Cause**: Database connection exhaustion125- **Resolution**: Rolled back v2.3.4, increased pool size126- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)127- **Follow-up tickets**: ENG-1230, ENG-1231128129---130131## 📋 Recent Changes132133### Deployments134| Service | Version | Time | Notes |135|---------|---------|------|-------|136| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |137| user-service | v2.8.0 | 01/20 10:00 | New profile features |138| auth-service | v4.1.2 | 01/19 16:00 | Security patch |139140### Configuration Changes141- 01/21: Increased API rate limit from 1000 to 1500 RPS142- 01/20: Updated database connection pool max from 50 to 75143144### Infrastructure145- 01/20: Added 2 nodes to Kubernetes cluster146- 01/19: Upgraded Redis from 6.2 to 7.0147148---149150## ⚠️ Known Issues & Workarounds151152### 1. Slow Dashboard Loading153**Issue**: Grafana dashboards slow on Monday mornings154**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up155**Ticket**: OPS-456 (P3)156157### 2. Flaky Integration Test158**Issue**: `test_payment_flow` fails intermittently in CI159**Workaround**: Re-run failed job (usually passes on retry)160**Ticket**: ENG-1200 (P2)161162---163164## 📅 Upcoming Events165166| Date | Event | Impact | Contact |167|------|-------|--------|---------|168| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |169| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |170| 01/25 | Marketing campaign | 2x traffic expected | @platform |171172---173174## 📞 Escalation Reminders175176| Issue Type | First Escalation | Second Escalation |177|------------|------------------|-------------------|178| Payment issues | @payments-oncall | @payments-manager |179| Auth issues | @auth-oncall | @security-team |180| Database issues | @dba-team | @infra-manager |181| Unknown/severe | @engineering-manager | @vp-engineering |182183---184185## 🔧 Quick Reference186187### Common Commands188```bash189# Check service health190kubectl get pods -A | grep -v Running191192# Recent deployments193kubectl get events --sort-by='.lastTimestamp' | tail -20194195# Database connections196psql -c "SELECT count(*) FROM pg_stat_activity;"197198# Clear cache (emergency only)199redis-cli FLUSHDB200```201202### Important Links203- [Runbooks](https://wiki/runbooks)204- [Service Catalog](https://wiki/services)205- [Incident Slack](https://slack.com/incidents)206- [PagerDuty](https://pagerduty.com/schedules)207208---209210## Handoff Checklist211212### Outgoing Engineer213- [x] Document active incidents214- [x] Document ongoing investigations215- [x] List recent changes216- [x] Note known issues217- [x] Add upcoming events218- [x] Sync with incoming engineer219220### Incoming Engineer221- [ ] Read this document222- [ ] Join sync call223- [ ] Verify PagerDuty is routing to you224- [ ] Verify Slack notifications working225- [ ] Check VPN/access working226- [ ] Review critical dashboards227```228229### Template 2: Quick Handoff (Async)230231```markdown232# Quick Handoff: @alice → @bob233234## TL;DR235- No active incidents236- 1 investigation ongoing (API timeouts, see ENG-1234)237- Major release tomorrow (01/24) - be ready for issues238239## Watch List2401. API latency around 02:00-03:00 UTC (backup window)2412. Auth service memory (restart if > 80%)242243## Recent244- Deployed api-gateway v3.2.1 yesterday (stable)245- Increased rate limits to 1500 RPS246247## Coming Up248- 01/23 02:00 - DB maintenance (5 min read-only)249- 01/24 14:00 - v5.0 release250251## Questions?252I'll be available on Slack until 17:00 today.253```254255### Template 3: Incident Handoff (Mid-Incident)256257```markdown258# INCIDENT HANDOFF: Payment Service Degradation259260**Incident Start**: 2024-01-22 08:15 UTC261**Current Status**: Mitigating262**Severity**: SEV2263264---265266## Current State267- Error rate: 15% (down from 40%)268- Mitigation in progress: scaling up pods269- ETA to resolution: ~30 min270271## What We Know2721. Root cause: Memory pressure on payment-service pods2732. Triggered by: Unusual traffic spike (3x normal)2743. Contributing: Inefficient query in checkout flow275276## What We've Done277- Scaled payment-service from 5 → 15 pods278- Enabled rate limiting on checkout endpoint279- Disabled non-critical features280281## What Needs to Happen2821. Monitor error rate - should reach <1% in ~15 min2832. If not improving, escalate to @payments-manager2843. Once stable, begin root cause investigation285286## Key People287- Incident Commander: @alice (handing off)288- Comms Lead: @charlie289- Technical Lead: @bob (incoming)290291## Communication292- Status page: Updated at 08:45293- Customer support: Notified294- Exec team: Aware295296## Resources297- Incident channel: #inc-20240122-payment298- Dashboard: [Payment Service](https://grafana/d/payments)299- Runbook: [Payment Degradation](https://wiki/runbooks/payments)300301---302303**Incoming on-call (@bob) - Please confirm you have:**304- [ ] Joined #inc-20240122-payment305- [ ] Access to dashboards306- [ ] Understand current state307- [ ] Know escalation path308```309310## Handoff Sync Meeting311312### Agenda (15 minutes)313314```markdown315## Handoff Sync: @alice → @bob3163171. **Active Issues** (5 min)318 - Walk through any ongoing incidents319 - Discuss investigation status320 - Transfer context and theories3213222. **Recent Changes** (3 min)323 - Deployments to watch324 - Config changes325 - Known regressions3263273. **Upcoming Events** (3 min)328 - Maintenance windows329 - Expected traffic changes330 - Releases planned3313324. **Questions** (4 min)333 - Clarify anything unclear334 - Confirm access and alerting335 - Exchange contact info336```337338## On-Call Best Practices339340### Before Your Shift341342```markdown343## Pre-Shift Checklist344345### Access Verification346- [ ] VPN working347- [ ] kubectl access to all clusters348- [ ] Database read access349- [ ] Log aggregator access (Splunk/Datadog)350- [ ] PagerDuty app installed and logged in351352### Alerting Setup353- [ ] PagerDuty schedule shows you as primary354- [ ] Phone notifications enabled355- [ ] Slack notifications for incident channels356- [ ] Test alert received and acknowledged357358### Knowledge Refresh359- [ ] Review recent incidents (past 2 weeks)360- [ ] Check service changelog361- [ ] Skim critical runbooks362- [ ] Know escalation contacts363364### Environment Ready365- [ ] Laptop charged and accessible366- [ ] Phone charged367- [ ] Quiet space available for calls368- [ ] Secondary contact identified (if traveling)369```370371### During Your Shift372373```markdown374## Daily On-Call Routine375376### Morning (start of day)377- [ ] Check overnight alerts378- [ ] Review dashboards for anomalies379- [ ] Check for any P0/P1 tickets created380- [ ] Skim incident channels for context381382### Throughout Day383- [ ] Respond to alerts within SLA384- [ ] Document investigation progress385- [ ] Update team on significant issues386- [ ] Triage incoming pages387388### End of Day389- [ ] Hand off any active issues390- [ ] Update investigation docs391- [ ] Note anything for next shift392```393394### After Your Shift395396```markdown397## Post-Shift Checklist398399- [ ] Complete handoff document400- [ ] Sync with incoming on-call401- [ ] Verify PagerDuty routing changed402- [ ] Close/update investigation tickets403- [ ] File postmortems for any incidents404- [ ] Take time off if shift was stressful405```406407## Escalation Guidelines408409### When to Escalate410411```markdown412## Escalation Triggers413414### Immediate Escalation415- SEV1 incident declared416- Data breach suspected417- Unable to diagnose within 30 min418- Customer or legal escalation received419420### Consider Escalation421- Issue spans multiple teams422- Requires expertise you don't have423- Business impact exceeds threshold424- You're uncertain about next steps425426### How to Escalate4271. Page the appropriate escalation path4282. Provide brief context in Slack4293. Stay engaged until escalation acknowledges4304. Hand off cleanly, don't just disappear431```432433## Best Practices434435### Do's436- **Document everything** - Future you will thank you437- **Escalate early** - Better safe than sorry438- **Take breaks** - Alert fatigue is real439- **Keep handoffs synchronous** - Async loses context440- **Test your setup** - Before incidents, not during441442### Don'ts443- **Don't skip handoffs** - Context loss causes incidents444- **Don't hero** - Escalate when needed445- **Don't ignore alerts** - Even if they seem minor446- **Don't work sick** - Swap shifts instead447- **Don't disappear** - Stay reachable during shift448449## Resources450451- [Google SRE - Being On-Call](https://sre.google/sre-book/being-on-call/)452- [PagerDuty On-Call Guide](https://www.pagerduty.com/resources/learn/on-call-management/)453- [Increment On-Call Issue](https://increment.com/on-call/)454
Full transparency — inspect the skill content before installing.