Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.
Add this skill
npx mdskills install sickn33/distributed-tracingComprehensive distributed tracing guide with excellent multi-language instrumentation examples and deployment patterns
1---2name: distributed-tracing3description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.4---56# Distributed Tracing78Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.910## Do not use this skill when1112- The task is unrelated to distributed tracing13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Purpose2324Track requests across distributed systems to understand latency, dependencies, and failure points.2526## Use this skill when2728- Debug latency issues29- Understand service dependencies30- Identify bottlenecks31- Trace error propagation32- Analyze request paths3334## Distributed Tracing Concepts3536### Trace Structure37```38Trace (Request ID: abc123)39 ↓40Span (frontend) [100ms]41 ↓42Span (api-gateway) [80ms]43 ├→ Span (auth-service) [10ms]44 └→ Span (user-service) [60ms]45 └→ Span (database) [40ms]46```4748### Key Components49- **Trace** - End-to-end request journey50- **Span** - Single operation within a trace51- **Context** - Metadata propagated between services52- **Tags** - Key-value pairs for filtering53- **Logs** - Timestamped events within a span5455## Jaeger Setup5657### Kubernetes Deployment5859```bash60# Deploy Jaeger Operator61kubectl create namespace observability62kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability6364# Deploy Jaeger instance65kubectl apply -f - <<EOF66apiVersion: jaegertracing.io/v167kind: Jaeger68metadata:69 name: jaeger70 namespace: observability71spec:72 strategy: production73 storage:74 type: elasticsearch75 options:76 es:77 server-urls: http://elasticsearch:920078 ingress:79 enabled: true80EOF81```8283### Docker Compose8485```yaml86version: '3.8'87services:88 jaeger:89 image: jaegertracing/all-in-one:latest90 ports:91 - "5775:5775/udp"92 - "6831:6831/udp"93 - "6832:6832/udp"94 - "5778:5778"95 - "16686:16686" # UI96 - "14268:14268" # Collector97 - "14250:14250" # gRPC98 - "9411:9411" # Zipkin99 environment:100 - COLLECTOR_ZIPKIN_HOST_PORT=:9411101```102103**Reference:** See `references/jaeger-setup.md`104105## Application Instrumentation106107### OpenTelemetry (Recommended)108109#### Python (Flask)110```python111from opentelemetry import trace112from opentelemetry.exporter.jaeger.thrift import JaegerExporter113from opentelemetry.sdk.resources import SERVICE_NAME, Resource114from opentelemetry.sdk.trace import TracerProvider115from opentelemetry.sdk.trace.export import BatchSpanProcessor116from opentelemetry.instrumentation.flask import FlaskInstrumentor117from flask import Flask118119# Initialize tracer120resource = Resource(attributes={SERVICE_NAME: "my-service"})121provider = TracerProvider(resource=resource)122processor = BatchSpanProcessor(JaegerExporter(123 agent_host_name="jaeger",124 agent_port=6831,125))126provider.add_span_processor(processor)127trace.set_tracer_provider(provider)128129# Instrument Flask130app = Flask(__name__)131FlaskInstrumentor().instrument_app(app)132133@app.route('/api/users')134def get_users():135 tracer = trace.get_tracer(__name__)136137 with tracer.start_as_current_span("get_users") as span:138 span.set_attribute("user.count", 100)139 # Business logic140 users = fetch_users_from_db()141 return {"users": users}142143def fetch_users_from_db():144 tracer = trace.get_tracer(__name__)145146 with tracer.start_as_current_span("database_query") as span:147 span.set_attribute("db.system", "postgresql")148 span.set_attribute("db.statement", "SELECT * FROM users")149 # Database query150 return query_database()151```152153#### Node.js (Express)154```javascript155const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');156const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');157const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');158const { registerInstrumentations } = require('@opentelemetry/instrumentation');159const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');160const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');161162// Initialize tracer163const provider = new NodeTracerProvider({164 resource: { attributes: { 'service.name': 'my-service' } }165});166167const exporter = new JaegerExporter({168 endpoint: 'http://jaeger:14268/api/traces'169});170171provider.addSpanProcessor(new BatchSpanProcessor(exporter));172provider.register();173174// Instrument libraries175registerInstrumentations({176 instrumentations: [177 new HttpInstrumentation(),178 new ExpressInstrumentation(),179 ],180});181182const express = require('express');183const app = express();184185app.get('/api/users', async (req, res) => {186 const tracer = trace.getTracer('my-service');187 const span = tracer.startSpan('get_users');188189 try {190 const users = await fetchUsers();191 span.setAttributes({ 'user.count': users.length });192 res.json({ users });193 } finally {194 span.end();195 }196});197```198199#### Go200```go201package main202203import (204 "context"205 "go.opentelemetry.io/otel"206 "go.opentelemetry.io/otel/exporters/jaeger"207 "go.opentelemetry.io/otel/sdk/resource"208 sdktrace "go.opentelemetry.io/otel/sdk/trace"209 semconv "go.opentelemetry.io/otel/semconv/v1.4.0"210)211212func initTracer() (*sdktrace.TracerProvider, error) {213 exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(214 jaeger.WithEndpoint("http://jaeger:14268/api/traces"),215 ))216 if err != nil {217 return nil, err218 }219220 tp := sdktrace.NewTracerProvider(221 sdktrace.WithBatcher(exporter),222 sdktrace.WithResource(resource.NewWithAttributes(223 semconv.SchemaURL,224 semconv.ServiceNameKey.String("my-service"),225 )),226 )227228 otel.SetTracerProvider(tp)229 return tp, nil230}231232func getUsers(ctx context.Context) ([]User, error) {233 tracer := otel.Tracer("my-service")234 ctx, span := tracer.Start(ctx, "get_users")235 defer span.End()236237 span.SetAttributes(attribute.String("user.filter", "active"))238239 users, err := fetchUsersFromDB(ctx)240 if err != nil {241 span.RecordError(err)242 return nil, err243 }244245 span.SetAttributes(attribute.Int("user.count", len(users)))246 return users, nil247}248```249250**Reference:** See `references/instrumentation.md`251252## Context Propagation253254### HTTP Headers255```256traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01257tracestate: congo=t61rcWkgMzE258```259260### Propagation in HTTP Requests261262#### Python263```python264from opentelemetry.propagate import inject265266headers = {}267inject(headers) # Injects trace context268269response = requests.get('http://downstream-service/api', headers=headers)270```271272#### Node.js273```javascript274const { propagation } = require('@opentelemetry/api');275276const headers = {};277propagation.inject(context.active(), headers);278279axios.get('http://downstream-service/api', { headers });280```281282## Tempo Setup (Grafana)283284### Kubernetes Deployment285286```yaml287apiVersion: v1288kind: ConfigMap289metadata:290 name: tempo-config291data:292 tempo.yaml: |293 server:294 http_listen_port: 3200295296 distributor:297 receivers:298 jaeger:299 protocols:300 thrift_http:301 grpc:302 otlp:303 protocols:304 http:305 grpc:306307 storage:308 trace:309 backend: s3310 s3:311 bucket: tempo-traces312 endpoint: s3.amazonaws.com313314 querier:315 frontend_worker:316 frontend_address: tempo-query-frontend:9095317---318apiVersion: apps/v1319kind: Deployment320metadata:321 name: tempo322spec:323 replicas: 1324 template:325 spec:326 containers:327 - name: tempo328 image: grafana/tempo:latest329 args:330 - -config.file=/etc/tempo/tempo.yaml331 volumeMounts:332 - name: config333 mountPath: /etc/tempo334 volumes:335 - name: config336 configMap:337 name: tempo-config338```339340**Reference:** See `assets/jaeger-config.yaml.template`341342## Sampling Strategies343344### Probabilistic Sampling345```yaml346# Sample 1% of traces347sampler:348 type: probabilistic349 param: 0.01350```351352### Rate Limiting Sampling353```yaml354# Sample max 100 traces per second355sampler:356 type: ratelimiting357 param: 100358```359360### Adaptive Sampling361```python362from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased363364# Sample based on trace ID (deterministic)365sampler = ParentBased(root=TraceIdRatioBased(0.01))366```367368## Trace Analysis369370### Finding Slow Requests371372**Jaeger Query:**373```374service=my-service375duration > 1s376```377378### Finding Errors379380**Jaeger Query:**381```382service=my-service383error=true384tags.http.status_code >= 500385```386387### Service Dependency Graph388389Jaeger automatically generates service dependency graphs showing:390- Service relationships391- Request rates392- Error rates393- Average latencies394395## Best Practices3963971. **Sample appropriately** (1-10% in production)3982. **Add meaningful tags** (user_id, request_id)3993. **Propagate context** across all service boundaries4004. **Log exceptions** in spans4015. **Use consistent naming** for operations4026. **Monitor tracing overhead** (<1% CPU impact)4037. **Set up alerts** for trace errors4048. **Implement distributed context** (baggage)4059. **Use span events** for important milestones40610. **Document instrumentation** standards407408## Integration with Logging409410### Correlated Logs411```python412import logging413from opentelemetry import trace414415logger = logging.getLogger(__name__)416417def process_request():418 span = trace.get_current_span()419 trace_id = span.get_span_context().trace_id420421 logger.info(422 "Processing request",423 extra={"trace_id": format(trace_id, '032x')}424 )425```426427## Troubleshooting428429**No traces appearing:**430- Check collector endpoint431- Verify network connectivity432- Check sampling configuration433- Review application logs434435**High latency overhead:**436- Reduce sampling rate437- Use batch span processor438- Check exporter configuration439440## Reference Files441442- `references/jaeger-setup.md` - Jaeger installation443- `references/instrumentation.md` - Instrumentation patterns444- `assets/jaeger-config.yaml.template` - Jaeger configuration445446## Related Skills447448- `prometheus-configuration` - For metrics449- `grafana-dashboards` - For visualization450- `slo-implementation` - For latency SLOs451
Full transparency — inspect the skill content before installing.