Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
Add this skill
npx mdskills install sickn33/service-mesh-observabilityComprehensive observability guide with production-ready configs for Istio, Linkerd, and monitoring stacks
1---2name: service-mesh-observability3description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.4---56# Service Mesh Observability78Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.910## Do not use this skill when1112- The task is unrelated to service mesh observability13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Use this skill when2324- Setting up distributed tracing across services25- Implementing service mesh metrics and dashboards26- Debugging latency and error issues27- Defining SLOs for service communication28- Visualizing service dependencies29- Troubleshooting mesh connectivity3031## Core Concepts3233### 1. Three Pillars of Observability3435```36┌─────────────────────────────────────────────────────┐37│ Observability │38├─────────────────┬─────────────────┬─────────────────┤39│ Metrics │ Traces │ Logs │40│ │ │ │41│ • Request rate │ • Span context │ • Access logs │42│ • Error rate │ • Latency │ • Error details │43│ • Latency P50 │ • Dependencies │ • Debug info │44│ • Saturation │ • Bottlenecks │ • Audit trail │45└─────────────────┴─────────────────┴─────────────────┘46```4748### 2. Golden Signals for Mesh4950| Signal | Description | Alert Threshold |51|--------|-------------|-----------------|52| **Latency** | Request duration P50, P99 | P99 > 500ms |53| **Traffic** | Requests per second | Anomaly detection |54| **Errors** | 5xx error rate | > 1% |55| **Saturation** | Resource utilization | > 80% |5657## Templates5859### Template 1: Istio with Prometheus & Grafana6061```yaml62# Install Prometheus63apiVersion: v164kind: ConfigMap65metadata:66 name: prometheus67 namespace: istio-system68data:69 prometheus.yml: |70 global:71 scrape_interval: 15s72 scrape_configs:73 - job_name: 'istio-mesh'74 kubernetes_sd_configs:75 - role: endpoints76 namespaces:77 names:78 - istio-system79 relabel_configs:80 - source_labels: [__meta_kubernetes_service_name]81 action: keep82 regex: istio-telemetry83---84# ServiceMonitor for Prometheus Operator85apiVersion: monitoring.coreos.com/v186kind: ServiceMonitor87metadata:88 name: istio-mesh89 namespace: istio-system90spec:91 selector:92 matchLabels:93 app: istiod94 endpoints:95 - port: http-monitoring96 interval: 15s97```9899### Template 2: Key Istio Metrics Queries100101```promql102# Request rate by service103sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)104105# Error rate (5xx)106sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))107 / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100108109# P99 latency110histogram_quantile(0.99,111 sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))112 by (le, destination_service_name))113114# TCP connections115sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)116117# Request size118histogram_quantile(0.99,119 sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))120 by (le, destination_service_name))121```122123### Template 3: Jaeger Distributed Tracing124125```yaml126# Jaeger installation for Istio127apiVersion: install.istio.io/v1alpha1128kind: IstioOperator129spec:130 meshConfig:131 enableTracing: true132 defaultConfig:133 tracing:134 sampling: 100.0 # 100% in dev, lower in prod135 zipkin:136 address: jaeger-collector.istio-system:9411137---138# Jaeger deployment139apiVersion: apps/v1140kind: Deployment141metadata:142 name: jaeger143 namespace: istio-system144spec:145 selector:146 matchLabels:147 app: jaeger148 template:149 metadata:150 labels:151 app: jaeger152 spec:153 containers:154 - name: jaeger155 image: jaegertracing/all-in-one:1.50156 ports:157 - containerPort: 5775 # UDP158 - containerPort: 6831 # Thrift159 - containerPort: 6832 # Thrift160 - containerPort: 5778 # Config161 - containerPort: 16686 # UI162 - containerPort: 14268 # HTTP163 - containerPort: 14250 # gRPC164 - containerPort: 9411 # Zipkin165 env:166 - name: COLLECTOR_ZIPKIN_HOST_PORT167 value: ":9411"168```169170### Template 4: Linkerd Viz Dashboard171172```bash173# Install Linkerd viz extension174linkerd viz install | kubectl apply -f -175176# Access dashboard177linkerd viz dashboard178179# CLI commands for observability180# Top requests181linkerd viz top deploy/my-app182183# Per-route metrics184linkerd viz routes deploy/my-app --to deploy/backend185186# Live traffic inspection187linkerd viz tap deploy/my-app --to deploy/backend188189# Service edges (dependencies)190linkerd viz edges deployment -n my-namespace191```192193### Template 5: Grafana Dashboard JSON194195```json196{197 "dashboard": {198 "title": "Service Mesh Overview",199 "panels": [200 {201 "title": "Request Rate",202 "type": "graph",203 "targets": [204 {205 "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",206 "legendFormat": "{{destination_service_name}}"207 }208 ]209 },210 {211 "title": "Error Rate",212 "type": "gauge",213 "targets": [214 {215 "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"216 }217 ],218 "fieldConfig": {219 "defaults": {220 "thresholds": {221 "steps": [222 {"value": 0, "color": "green"},223 {"value": 1, "color": "yellow"},224 {"value": 5, "color": "red"}225 ]226 }227 }228 }229 },230 {231 "title": "P99 Latency",232 "type": "graph",233 "targets": [234 {235 "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",236 "legendFormat": "{{destination_service_name}}"237 }238 ]239 },240 {241 "title": "Service Topology",242 "type": "nodeGraph",243 "targets": [244 {245 "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"246 }247 ]248 }249 ]250 }251}252```253254### Template 6: Kiali Service Mesh Visualization255256```yaml257# Kiali installation258apiVersion: kiali.io/v1alpha1259kind: Kiali260metadata:261 name: kiali262 namespace: istio-system263spec:264 auth:265 strategy: anonymous # or openid, token266 deployment:267 accessible_namespaces:268 - "**"269 external_services:270 prometheus:271 url: http://prometheus.istio-system:9090272 tracing:273 url: http://jaeger-query.istio-system:16686274 grafana:275 url: http://grafana.istio-system:3000276```277278### Template 7: OpenTelemetry Integration279280```yaml281# OpenTelemetry Collector for mesh282apiVersion: v1283kind: ConfigMap284metadata:285 name: otel-collector-config286data:287 config.yaml: |288 receivers:289 otlp:290 protocols:291 grpc:292 endpoint: 0.0.0.0:4317293 http:294 endpoint: 0.0.0.0:4318295 zipkin:296 endpoint: 0.0.0.0:9411297298 processors:299 batch:300 timeout: 10s301302 exporters:303 jaeger:304 endpoint: jaeger-collector:14250305 tls:306 insecure: true307 prometheus:308 endpoint: 0.0.0.0:8889309310 service:311 pipelines:312 traces:313 receivers: [otlp, zipkin]314 processors: [batch]315 exporters: [jaeger]316 metrics:317 receivers: [otlp]318 processors: [batch]319 exporters: [prometheus]320---321# Istio Telemetry v2 with OTel322apiVersion: telemetry.istio.io/v1alpha1323kind: Telemetry324metadata:325 name: mesh-default326 namespace: istio-system327spec:328 tracing:329 - providers:330 - name: otel331 randomSamplingPercentage: 10332```333334## Alerting Rules335336```yaml337apiVersion: monitoring.coreos.com/v1338kind: PrometheusRule339metadata:340 name: mesh-alerts341 namespace: istio-system342spec:343 groups:344 - name: mesh.rules345 rules:346 - alert: HighErrorRate347 expr: |348 sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)349 / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05350 for: 5m351 labels:352 severity: critical353 annotations:354 summary: "High error rate for {{ $labels.destination_service_name }}"355356 - alert: HighLatency357 expr: |358 histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))359 by (le, destination_service_name)) > 1000360 for: 5m361 labels:362 severity: warning363 annotations:364 summary: "High P99 latency for {{ $labels.destination_service_name }}"365366 - alert: MeshCertExpiring367 expr: |368 (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7369 labels:370 severity: warning371 annotations:372 summary: "Mesh certificate expiring in less than 7 days"373```374375## Best Practices376377### Do's378- **Sample appropriately** - 100% in dev, 1-10% in prod379- **Use trace context** - Propagate headers consistently380- **Set up alerts** - For golden signals381- **Correlate metrics/traces** - Use exemplars382- **Retain strategically** - Hot/cold storage tiers383384### Don'ts385- **Don't over-sample** - Storage costs add up386- **Don't ignore cardinality** - Limit label values387- **Don't skip dashboards** - Visualize dependencies388- **Don't forget costs** - Monitor observability costs389390## Resources391392- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)393- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)394- [OpenTelemetry](https://opentelemetry.io/)395- [Kiali](https://kiali.io/)396
Full transparency — inspect the skill content before installing.