How do I install Data Engineering Data Pipeline?

Install Data Engineering Data Pipeline with a single command: npx mdskills install sickn33/data-engineering-data-pipeline. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Data Engineering Data Pipeline?

Data Engineering Data Pipeline works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Data Engineering Data Pipeline

Name: Data Engineering Data Pipeline: AI Agent Skill
Rating: 8 (1 reviews)
Author: sickn33

Verified

Productivity & TasksIntermediate

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/data-engineering-data-pipeline

Fork & Edit

Skill Advisor8.0

Comprehensive data pipeline guidance with actionable patterns, clear examples, and strong observability focus

+Provides detailed architecture patterns (ETL, ELT, Lambda, Kappa, Lakehouse) with selection criteria
+Includes concrete code examples with error handling and data quality validation
+Covers end-to-end concerns: ingestion, transformation, monitoring, and cost optimization
-Declares shell execution permission without clear justification in instructions
-Example code references non-existent modules without setup context

SKILL.md

Edit in Browser

1---
2name: data-engineering-data-pipeline
3description: "You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing."
4---
5 
6# Data Pipeline Architecture
7 
8You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.
9 
10## Use this skill when
11 
12- Working on data pipeline architecture tasks or workflows
13- Needing guidance, best practices, or checklists for data pipeline architecture
14 
15## Do not use this skill when
16 
17- The task is unrelated to data pipeline architecture
18- You need a different domain or tool outside this scope
19 
20## Requirements
21 
22$ARGUMENTS
23 
24## Core Capabilities
25 
26- Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
27- Implement batch and streaming data ingestion
28- Build workflow orchestration with Airflow/Prefect
29- Transform data using dbt and Spark
30- Manage Delta Lake/Iceberg storage with ACID transactions
31- Implement data quality frameworks (Great Expectations, dbt tests)
32- Monitor pipelines with CloudWatch/Prometheus/Grafana
33- Optimize costs through partitioning, lifecycle policies, and compute optimization
34 
35## Instructions
36 
37### 1. Architecture Design
38- Assess: sources, volume, latency requirements, targets
39- Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
40- Design flow: sources → ingestion → processing → storage → serving
41- Add observability touchpoints
42 
43### 2. Ingestion Implementation
44**Batch**
45- Incremental loading with watermark columns
46- Retry logic with exponential backoff
47- Schema validation and dead letter queue for invalid records
48- Metadata tracking (_extracted_at, _source)
49 
50**Streaming**
51- Kafka consumers with exactly-once semantics
52- Manual offset commits within transactions
53- Windowing for time-based aggregations
54- Error handling and replay capability
55 
56### 3. Orchestration
57**Airflow**
58- Task groups for logical organization
59- XCom for inter-task communication
60- SLA monitoring and email alerts
61- Incremental execution with execution_date
62- Retry with exponential backoff
63 
64**Prefect**
65- Task caching for idempotency
66- Parallel execution with .submit()
67- Artifacts for visibility
68- Automatic retries with configurable delays
69 
70### 4. Transformation with dbt
71- Staging layer: incremental materialization, deduplication, late-arriving data handling
72- Marts layer: dimensional models, aggregations, business logic
73- Tests: unique, not_null, relationships, accepted_values, custom data quality tests
74- Sources: freshness checks, loaded_at_field tracking
75- Incremental strategy: merge or delete+insert
76 
77### 5. Data Quality Framework
78**Great Expectations**
79- Table-level: row count, column count
80- Column-level: uniqueness, nullability, type validation, value sets, ranges
81- Checkpoints for validation execution
82- Data docs for documentation
83- Failure notifications
84 
85**dbt Tests**
86- Schema tests in YAML
87- Custom data quality tests with dbt-expectations
88- Test results tracked in metadata
89 
90### 6. Storage Strategy
91**Delta Lake**
92- ACID transactions with append/overwrite/merge modes
93- Upsert with predicate-based matching
94- Time travel for historical queries
95- Optimize: compact small files, Z-order clustering
96- Vacuum to remove old files
97 
98**Apache Iceberg**
99- Partitioning and sort order optimization
100- MERGE INTO for upserts
101- Snapshot isolation and time travel
102- File compaction with binpack strategy
103- Snapshot expiration for cleanup
104 
105### 7. Monitoring & Cost Optimization
106**Monitoring**
107- Track: records processed/failed, data size, execution time, success/failure rates
108- CloudWatch metrics and custom namespaces
109- SNS alerts for critical/warning/info events
110- Data freshness checks
111- Performance trend analysis
112 
113**Cost Optimization**
114- Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
115- File sizes: 512MB-1GB for Parquet
116- Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
117- Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
118- Query optimization: partition pruning, clustering, predicate pushdown
119 
120## Example: Minimal Batch Pipeline
121 
122```python
123# Batch ingestion with validation
124from batch_ingestion import BatchDataIngester
125from storage.delta_lake_manager import DeltaLakeManager
126from data_quality.expectations_suite import DataQualityFramework
127 
128ingester = BatchDataIngester(config={})
129 
130# Extract with incremental loading
131df = ingester.extract_from_database(
132    connection_string='postgresql://host:5432/db',
133    query='SELECT * FROM orders',
134    watermark_column='updated_at',
135    last_watermark=last_run_timestamp
136)
137 
138# Validate
139schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
140df = ingester.validate_and_clean(df, schema)
141 
142# Data quality checks
143dq = DataQualityFramework()
144result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')
145 
146# Write to Delta Lake
147delta_mgr = DeltaLakeManager(storage_path='s3://lake')
148delta_mgr.create_or_update_table(
149    df=df,
150    table_name='orders',
151    partition_columns=['order_date'],
152    mode='append'
153)
154 
155# Save failed records
156ingester.save_dead_letter_queue('s3://lake/dlq/orders')
157```
158 
159## Output Deliverables
160 
161### 1. Architecture Documentation
162- Architecture diagram with data flow
163- Technology stack with justification
164- Scalability analysis and growth patterns
165- Failure modes and recovery strategies
166 
167### 2. Implementation Code
168- Ingestion: batch/streaming with error handling
169- Transformation: dbt models (staging → marts) or Spark jobs
170- Orchestration: Airflow/Prefect DAGs with dependencies
171- Storage: Delta/Iceberg table management
172- Data quality: Great Expectations suites and dbt tests
173 
174### 3. Configuration Files
175- Orchestration: DAG definitions, schedules, retry policies
176- dbt: models, sources, tests, project config
177- Infrastructure: Docker Compose, K8s manifests, Terraform
178- Environment: dev/staging/prod configs
179 
180### 4. Monitoring & Observability
181- Metrics: execution time, records processed, quality scores
182- Alerts: failures, performance degradation, data freshness
183- Dashboards: Grafana/CloudWatch for pipeline health
184- Logging: structured logs with correlation IDs
185 
186### 5. Operations Guide
187- Deployment procedures and rollback strategy
188- Troubleshooting guide for common issues
189- Scaling guide for increased volume
190- Cost optimization strategies and savings
191- Disaster recovery and backup procedures
192 
193## Success Criteria
194- Pipeline meets defined SLA (latency, throughput)
195- Data quality checks pass with >99% success rate
196- Automatic retry and alerting on failures
197- Comprehensive monitoring shows health and performance
198- Documentation enables team maintenance
199- Cost optimization reduces infrastructure costs by 30-50%
200- Schema evolution without downtime
201- End-to-end data lineage tracked
202

Full transparency — inspect the skill content before installing.