Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Add this skill
npx mdskills install sickn33/grafana-dashboardsComprehensive guide with production-ready JSON examples, design principles, and provisioning patterns
1---2name: grafana-dashboards3description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.4---56# Grafana Dashboards78Create and manage production-ready Grafana dashboards for comprehensive system observability.910## Do not use this skill when1112- The task is unrelated to grafana dashboards13- You need a different domain or tool outside this scope1415## Instructions1617- Clarify goals, constraints, and required inputs.18- Apply relevant best practices and validate outcomes.19- Provide actionable steps and verification.20- If detailed examples are required, open `resources/implementation-playbook.md`.2122## Purpose2324Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.2526## Use this skill when2728- Visualize Prometheus metrics29- Create custom dashboards30- Implement SLO dashboards31- Monitor infrastructure32- Track business KPIs3334## Dashboard Design Principles3536### 1. Hierarchy of Information37```38┌─────────────────────────────────────┐39│ Critical Metrics (Big Numbers) │40├─────────────────────────────────────┤41│ Key Trends (Time Series) │42├─────────────────────────────────────┤43│ Detailed Metrics (Tables/Heatmaps) │44└─────────────────────────────────────┘45```4647### 2. RED Method (Services)48- **Rate** - Requests per second49- **Errors** - Error rate50- **Duration** - Latency/response time5152### 3. USE Method (Resources)53- **Utilization** - % time resource is busy54- **Saturation** - Queue length/wait time55- **Errors** - Error count5657## Dashboard Structure5859### API Monitoring Dashboard6061```json62{63 "dashboard": {64 "title": "API Monitoring",65 "tags": ["api", "production"],66 "timezone": "browser",67 "refresh": "30s",68 "panels": [69 {70 "title": "Request Rate",71 "type": "graph",72 "targets": [73 {74 "expr": "sum(rate(http_requests_total[5m])) by (service)",75 "legendFormat": "{{service}}"76 }77 ],78 "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}79 },80 {81 "title": "Error Rate %",82 "type": "graph",83 "targets": [84 {85 "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",86 "legendFormat": "Error Rate"87 }88 ],89 "alert": {90 "conditions": [91 {92 "evaluator": {"params": [5], "type": "gt"},93 "operator": {"type": "and"},94 "query": {"params": ["A", "5m", "now"]},95 "type": "query"96 }97 ]98 },99 "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}100 },101 {102 "title": "P95 Latency",103 "type": "graph",104 "targets": [105 {106 "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",107 "legendFormat": "{{service}}"108 }109 ],110 "gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}111 }112 ]113 }114}115```116117**Reference:** See `assets/api-dashboard.json`118119## Panel Types120121### 1. Stat Panel (Single Value)122```json123{124 "type": "stat",125 "title": "Total Requests",126 "targets": [{127 "expr": "sum(http_requests_total)"128 }],129 "options": {130 "reduceOptions": {131 "values": false,132 "calcs": ["lastNotNull"]133 },134 "orientation": "auto",135 "textMode": "auto",136 "colorMode": "value"137 },138 "fieldConfig": {139 "defaults": {140 "thresholds": {141 "mode": "absolute",142 "steps": [143 {"value": 0, "color": "green"},144 {"value": 80, "color": "yellow"},145 {"value": 90, "color": "red"}146 ]147 }148 }149 }150}151```152153### 2. Time Series Graph154```json155{156 "type": "graph",157 "title": "CPU Usage",158 "targets": [{159 "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"160 }],161 "yaxes": [162 {"format": "percent", "max": 100, "min": 0},163 {"format": "short"}164 ]165}166```167168### 3. Table Panel169```json170{171 "type": "table",172 "title": "Service Status",173 "targets": [{174 "expr": "up",175 "format": "table",176 "instant": true177 }],178 "transformations": [179 {180 "id": "organize",181 "options": {182 "excludeByName": {"Time": true},183 "indexByName": {},184 "renameByName": {185 "instance": "Instance",186 "job": "Service",187 "Value": "Status"188 }189 }190 }191 ]192}193```194195### 4. Heatmap196```json197{198 "type": "heatmap",199 "title": "Latency Heatmap",200 "targets": [{201 "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",202 "format": "heatmap"203 }],204 "dataFormat": "tsbuckets",205 "yAxis": {206 "format": "s"207 }208}209```210211## Variables212213### Query Variables214```json215{216 "templating": {217 "list": [218 {219 "name": "namespace",220 "type": "query",221 "datasource": "Prometheus",222 "query": "label_values(kube_pod_info, namespace)",223 "refresh": 1,224 "multi": false225 },226 {227 "name": "service",228 "type": "query",229 "datasource": "Prometheus",230 "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",231 "refresh": 1,232 "multi": true233 }234 ]235 }236}237```238239### Use Variables in Queries240```241sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))242```243244## Alerts in Dashboards245246```json247{248 "alert": {249 "name": "High Error Rate",250 "conditions": [251 {252 "evaluator": {253 "params": [5],254 "type": "gt"255 },256 "operator": {"type": "and"},257 "query": {258 "params": ["A", "5m", "now"]259 },260 "reducer": {"type": "avg"},261 "type": "query"262 }263 ],264 "executionErrorState": "alerting",265 "for": "5m",266 "frequency": "1m",267 "message": "Error rate is above 5%",268 "noDataState": "no_data",269 "notifications": [270 {"uid": "slack-channel"}271 ]272 }273}274```275276## Dashboard Provisioning277278**dashboards.yml:**279```yaml280apiVersion: 1281282providers:283 - name: 'default'284 orgId: 1285 folder: 'General'286 type: file287 disableDeletion: false288 updateIntervalSeconds: 10289 allowUiUpdates: true290 options:291 path: /etc/grafana/dashboards292```293294## Common Dashboard Patterns295296### Infrastructure Dashboard297298**Key Panels:**299- CPU utilization per node300- Memory usage per node301- Disk I/O302- Network traffic303- Pod count by namespace304- Node status305306**Reference:** See `assets/infrastructure-dashboard.json`307308### Database Dashboard309310**Key Panels:**311- Queries per second312- Connection pool usage313- Query latency (P50, P95, P99)314- Active connections315- Database size316- Replication lag317- Slow queries318319**Reference:** See `assets/database-dashboard.json`320321### Application Dashboard322323**Key Panels:**324- Request rate325- Error rate326- Response time (percentiles)327- Active users/sessions328- Cache hit rate329- Queue length330331## Best Practices3323331. **Start with templates** (Grafana community dashboards)3342. **Use consistent naming** for panels and variables3353. **Group related metrics** in rows3364. **Set appropriate time ranges** (default: Last 6 hours)3375. **Use variables** for flexibility3386. **Add panel descriptions** for context3397. **Configure units** correctly3408. **Set meaningful thresholds** for colors3419. **Use consistent colors** across dashboards34210. **Test with different time ranges**343344## Dashboard as Code345346### Terraform Provisioning347348```hcl349resource "grafana_dashboard" "api_monitoring" {350 config_json = file("${path.module}/dashboards/api-monitoring.json")351 folder = grafana_folder.monitoring.id352}353354resource "grafana_folder" "monitoring" {355 title = "Production Monitoring"356}357```358359### Ansible Provisioning360361```yaml362- name: Deploy Grafana dashboards363 copy:364 src: "{{ item }}"365 dest: /etc/grafana/dashboards/366 with_fileglob:367 - "dashboards/*.json"368 notify: restart grafana369```370371## Reference Files372373- `assets/api-dashboard.json` - API monitoring dashboard374- `assets/infrastructure-dashboard.json` - Infrastructure dashboard375- `assets/database-dashboard.json` - Database monitoring dashboard376- `references/dashboard-design.md` - Dashboard design guide377378## Related Skills379380- `prometheus-configuration` - For metric collection381- `slo-implementation` - For SLO dashboards382
Full transparency — inspect the skill content before installing.