Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Add this skill
npx mdskills install sickn33/slo-implementationComprehensive SRE framework with concrete Prometheus queries, error budget policies, and multi-window alerting
1---2name: slo-implementation3description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.4---56# SLO Implementation78Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.910## Do not use this skill when1112- The task is unrelated to slo implementation13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Purpose2324Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.2526## Use this skill when2728- Define service reliability targets29- Measure user-perceived reliability30- Implement error budgets31- Create SLO-based alerts32- Track reliability goals3334## SLI/SLO/SLA Hierarchy3536```37SLA (Service Level Agreement)38 ↓ Contract with customers39SLO (Service Level Objective)40 ↓ Internal reliability target41SLI (Service Level Indicator)42 ↓ Actual measurement43```4445## Defining SLIs4647### Common SLI Types4849#### 1. Availability SLI50```promql51# Successful requests / Total requests52sum(rate(http_requests_total{status!~"5.."}[28d]))53/54sum(rate(http_requests_total[28d]))55```5657#### 2. Latency SLI58```promql59# Requests below latency threshold / Total requests60sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))61/62sum(rate(http_request_duration_seconds_count[28d]))63```6465#### 3. Durability SLI66```67# Successful writes / Total writes68sum(storage_writes_successful_total)69/70sum(storage_writes_total)71```7273**Reference:** See `references/slo-definitions.md`7475## Setting SLO Targets7677### Availability SLO Examples7879| SLO % | Downtime/Month | Downtime/Year |80|-------|----------------|---------------|81| 99% | 7.2 hours | 3.65 days |82| 99.9% | 43.2 minutes | 8.76 hours |83| 99.95%| 21.6 minutes | 4.38 hours |84| 99.99%| 4.32 minutes | 52.56 minutes |8586### Choose Appropriate SLOs8788**Consider:**89- User expectations90- Business requirements91- Current performance92- Cost of reliability93- Competitor benchmarks9495**Example SLOs:**96```yaml97slos:98 - name: api_availability99 target: 99.9100 window: 28d101 sli: |102 sum(rate(http_requests_total{status!~"5.."}[28d]))103 /104 sum(rate(http_requests_total[28d]))105106 - name: api_latency_p95107 target: 99108 window: 28d109 sli: |110 sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))111 /112 sum(rate(http_request_duration_seconds_count[28d]))113```114115## Error Budget Calculation116117### Error Budget Formula118119```120Error Budget = 1 - SLO Target121```122123**Example:**124- SLO: 99.9% availability125- Error Budget: 0.1% = 43.2 minutes/month126- Current Error: 0.05% = 21.6 minutes/month127- Remaining Budget: 50%128129### Error Budget Policy130131```yaml132error_budget_policy:133 - remaining_budget: 100%134 action: Normal development velocity135 - remaining_budget: 50%136 action: Consider postponing risky changes137 - remaining_budget: 10%138 action: Freeze non-critical changes139 - remaining_budget: 0%140 action: Feature freeze, focus on reliability141```142143**Reference:** See `references/error-budget.md`144145## SLO Implementation146147### Prometheus Recording Rules148149```yaml150# SLI Recording Rules151groups:152 - name: sli_rules153 interval: 30s154 rules:155 # Availability SLI156 - record: sli:http_availability:ratio157 expr: |158 sum(rate(http_requests_total{status!~"5.."}[28d]))159 /160 sum(rate(http_requests_total[28d]))161162 # Latency SLI (requests < 500ms)163 - record: sli:http_latency:ratio164 expr: |165 sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))166 /167 sum(rate(http_request_duration_seconds_count[28d]))168169 - name: slo_rules170 interval: 5m171 rules:172 # SLO compliance (1 = meeting SLO, 0 = violating)173 - record: slo:http_availability:compliance174 expr: sli:http_availability:ratio >= bool 0.999175176 - record: slo:http_latency:compliance177 expr: sli:http_latency:ratio >= bool 0.99178179 # Error budget remaining (percentage)180 - record: slo:http_availability:error_budget_remaining181 expr: |182 (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100183184 # Error budget burn rate185 - record: slo:http_availability:burn_rate_5m186 expr: |187 (1 - (188 sum(rate(http_requests_total{status!~"5.."}[5m]))189 /190 sum(rate(http_requests_total[5m]))191 )) / (1 - 0.999)192```193194### SLO Alerting Rules195196```yaml197groups:198 - name: slo_alerts199 interval: 1m200 rules:201 # Fast burn: 14.4x rate, 1 hour window202 # Consumes 2% error budget in 1 hour203 - alert: SLOErrorBudgetBurnFast204 expr: |205 slo:http_availability:burn_rate_1h > 14.4206 and207 slo:http_availability:burn_rate_5m > 14.4208 for: 2m209 labels:210 severity: critical211 annotations:212 summary: "Fast error budget burn detected"213 description: "Error budget burning at {{ $value }}x rate"214215 # Slow burn: 6x rate, 6 hour window216 # Consumes 5% error budget in 6 hours217 - alert: SLOErrorBudgetBurnSlow218 expr: |219 slo:http_availability:burn_rate_6h > 6220 and221 slo:http_availability:burn_rate_30m > 6222 for: 15m223 labels:224 severity: warning225 annotations:226 summary: "Slow error budget burn detected"227 description: "Error budget burning at {{ $value }}x rate"228229 # Error budget exhausted230 - alert: SLOErrorBudgetExhausted231 expr: slo:http_availability:error_budget_remaining < 0232 for: 5m233 labels:234 severity: critical235 annotations:236 summary: "SLO error budget exhausted"237 description: "Error budget remaining: {{ $value }}%"238```239240## SLO Dashboard241242**Grafana Dashboard Structure:**243244```245┌────────────────────────────────────┐246│ SLO Compliance (Current) │247│ ✓ 99.95% (Target: 99.9%) │248├────────────────────────────────────┤249│ Error Budget Remaining: 65% │250│ ████████░░ 65% │251├────────────────────────────────────┤252│ SLI Trend (28 days) │253│ [Time series graph] │254├────────────────────────────────────┤255│ Burn Rate Analysis │256│ [Burn rate by time window] │257└────────────────────────────────────┘258```259260**Example Queries:**261262```promql263# Current SLO compliance264sli:http_availability:ratio * 100265266# Error budget remaining267slo:http_availability:error_budget_remaining268269# Days until error budget exhausted (at current burn rate)270(slo:http_availability:error_budget_remaining / 100)271*27228273/274(1 - sli:http_availability:ratio) * (1 - 0.999)275```276277## Multi-Window Burn Rate Alerts278279```yaml280# Combination of short and long windows reduces false positives281rules:282 - alert: SLOBurnRateHigh283 expr: |284 (285 slo:http_availability:burn_rate_1h > 14.4286 and287 slo:http_availability:burn_rate_5m > 14.4288 )289 or290 (291 slo:http_availability:burn_rate_6h > 6292 and293 slo:http_availability:burn_rate_30m > 6294 )295 labels:296 severity: critical297```298299## SLO Review Process300301### Weekly Review302- Current SLO compliance303- Error budget status304- Trend analysis305- Incident impact306307### Monthly Review308- SLO achievement309- Error budget usage310- Incident postmortems311- SLO adjustments312313### Quarterly Review314- SLO relevance315- Target adjustments316- Process improvements317- Tooling enhancements318319## Best Practices3203211. **Start with user-facing services**3222. **Use multiple SLIs** (availability, latency, etc.)3233. **Set achievable SLOs** (don't aim for 100%)3244. **Implement multi-window alerts** to reduce noise3255. **Track error budget** consistently3266. **Review SLOs regularly**3277. **Document SLO decisions**3288. **Align with business goals**3299. **Automate SLO reporting**33010. **Use SLOs for prioritization**331332## Reference Files333334- `assets/slo-template.md` - SLO definition template335- `references/slo-definitions.md` - SLO definition patterns336- `references/error-budget.md` - Error budget calculations337338## Related Skills339340- `prometheus-configuration` - For metric collection341- `grafana-dashboards` - For SLO visualization342
Full transparency — inspect the skill content before installing.