Design and implement a complete ML pipeline for: $ARGUMENTS
Add this skill
npx mdskills install sickn33/machine-learning-ops-ml-pipelineComprehensive multi-agent MLOps orchestration with phase-based coordination and modern tooling
1---2name: machine-learning-ops-ml-pipeline3description: "Design and implement a complete ML pipeline for: $ARGUMENTS"4---56# Machine Learning Pipeline - Multi-Agent MLOps Orchestration78Design and implement a complete ML pipeline for: $ARGUMENTS910## Use this skill when1112- Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows13- Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration1415## Do not use this skill when1617- The task is unrelated to machine learning pipeline - multi-agent mlops orchestration18- You need a different domain or tool outside this scope1920## Instructions2122- Clarify goals, constraints, and required inputs.23- Apply relevant best practices and validate outcomes.24- Provide actionable steps and verification.25- If detailed examples are required, open `resources/implementation-playbook.md`.2627## Thinking2829This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:3031- **Phase-based coordination**: Each phase builds upon previous outputs, with clear handoffs between agents32- **Modern tooling integration**: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving33- **Production-first mindset**: Every component designed for scale, monitoring, and reliability34- **Reproducibility**: Version control for data, models, and infrastructure35- **Continuous improvement**: Automated retraining, A/B testing, and drift detection3637The multi-agent approach ensures each aspect is handled by domain experts:38- Data engineers handle ingestion and quality39- Data scientists design features and experiments40- ML engineers implement training pipelines41- MLOps engineers handle production deployment42- Observability engineers ensure monitoring4344## Phase 1: Data & Requirements Analysis4546<Task>47subagent_type: data-engineer48prompt: |49 Analyze and design data pipeline for ML system with requirements: $ARGUMENTS5051 Deliverables:52 1. Data source audit and ingestion strategy:53 - Source systems and connection patterns54 - Schema validation using Pydantic/Great Expectations55 - Data versioning with DVC or lakeFS56 - Incremental loading and CDC strategies5758 2. Data quality framework:59 - Profiling and statistics generation60 - Anomaly detection rules61 - Data lineage tracking62 - Quality gates and SLAs6364 3. Storage architecture:65 - Raw/processed/feature layers66 - Partitioning strategy67 - Retention policies68 - Cost optimization6970 Provide implementation code for critical components and integration patterns.71</Task>7273<Task>74subagent_type: data-scientist75prompt: |76 Design feature engineering and model requirements for: $ARGUMENTS77 Using data architecture from: {phase1.data-engineer.output}7879 Deliverables:80 1. Feature engineering pipeline:81 - Transformation specifications82 - Feature store schema (Feast/Tecton)83 - Statistical validation rules84 - Handling strategies for missing data/outliers8586 2. Model requirements:87 - Algorithm selection rationale88 - Performance metrics and baselines89 - Training data requirements90 - Evaluation criteria and thresholds9192 3. Experiment design:93 - Hypothesis and success metrics94 - A/B testing methodology95 - Sample size calculations96 - Bias detection approach9798 Include feature transformation code and statistical validation logic.99</Task>100101## Phase 2: Model Development & Training102103<Task>104subagent_type: ml-engineer105prompt: |106 Implement training pipeline based on requirements: {phase1.data-scientist.output}107 Using data pipeline: {phase1.data-engineer.output}108109 Build comprehensive training system:110 1. Training pipeline implementation:111 - Modular training code with clear interfaces112 - Hyperparameter optimization (Optuna/Ray Tune)113 - Distributed training support (Horovod/PyTorch DDP)114 - Cross-validation and ensemble strategies115116 2. Experiment tracking setup:117 - MLflow/Weights & Biases integration118 - Metric logging and visualization119 - Artifact management (models, plots, data samples)120 - Experiment comparison and analysis tools121122 3. Model registry integration:123 - Version control and tagging strategy124 - Model metadata and lineage125 - Promotion workflows (dev -> staging -> prod)126 - Rollback procedures127128 Provide complete training code with configuration management.129</Task>130131<Task>132subagent_type: python-pro133prompt: |134 Optimize and productionize ML code from: {phase2.ml-engineer.output}135136 Focus areas:137 1. Code quality and structure:138 - Refactor for production standards139 - Add comprehensive error handling140 - Implement proper logging with structured formats141 - Create reusable components and utilities142143 2. Performance optimization:144 - Profile and optimize bottlenecks145 - Implement caching strategies146 - Optimize data loading and preprocessing147 - Memory management for large-scale training148149 3. Testing framework:150 - Unit tests for data transformations151 - Integration tests for pipeline components152 - Model quality tests (invariance, directional)153 - Performance regression tests154155 Deliver production-ready, maintainable code with full test coverage.156</Task>157158## Phase 3: Production Deployment & Serving159160<Task>161subagent_type: mlops-engineer162prompt: |163 Design production deployment for models from: {phase2.ml-engineer.output}164 With optimized code from: {phase2.python-pro.output}165166 Implementation requirements:167 1. Model serving infrastructure:168 - REST/gRPC APIs with FastAPI/TorchServe169 - Batch prediction pipelines (Airflow/Kubeflow)170 - Stream processing (Kafka/Kinesis integration)171 - Model serving platforms (KServe/Seldon Core)172173 2. Deployment strategies:174 - Blue-green deployments for zero downtime175 - Canary releases with traffic splitting176 - Shadow deployments for validation177 - A/B testing infrastructure178179 3. CI/CD pipeline:180 - GitHub Actions/GitLab CI workflows181 - Automated testing gates182 - Model validation before deployment183 - ArgoCD for GitOps deployment184185 4. Infrastructure as Code:186 - Terraform modules for cloud resources187 - Helm charts for Kubernetes deployments188 - Docker multi-stage builds for optimization189 - Secret management with Vault/Secrets Manager190191 Provide complete deployment configuration and automation scripts.192</Task>193194<Task>195subagent_type: kubernetes-architect196prompt: |197 Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}198199 Kubernetes-specific requirements:200 1. Workload orchestration:201 - Training job scheduling with Kubeflow202 - GPU resource allocation and sharing203 - Spot/preemptible instance integration204 - Priority classes and resource quotas205206 2. Serving infrastructure:207 - HPA/VPA for autoscaling208 - KEDA for event-driven scaling209 - Istio service mesh for traffic management210 - Model caching and warm-up strategies211212 3. Storage and data access:213 - PVC strategies for training data214 - Model artifact storage with CSI drivers215 - Distributed storage for feature stores216 - Cache layers for inference optimization217218 Provide Kubernetes manifests and Helm charts for entire ML platform.219</Task>220221## Phase 4: Monitoring & Continuous Improvement222223<Task>224subagent_type: observability-engineer225prompt: |226 Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}227 Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}228229 Monitoring framework:230 1. Model performance monitoring:231 - Prediction accuracy tracking232 - Latency and throughput metrics233 - Feature importance shifts234 - Business KPI correlation235236 2. Data and model drift detection:237 - Statistical drift detection (KS test, PSI)238 - Concept drift monitoring239 - Feature distribution tracking240 - Automated drift alerts and reports241242 3. System observability:243 - Prometheus metrics for all components244 - Grafana dashboards for visualization245 - Distributed tracing with Jaeger/Zipkin246 - Log aggregation with ELK/Loki247248 4. Alerting and automation:249 - PagerDuty/Opsgenie integration250 - Automated retraining triggers251 - Performance degradation workflows252 - Incident response runbooks253254 5. Cost tracking:255 - Resource utilization metrics256 - Cost allocation by model/experiment257 - Optimization recommendations258 - Budget alerts and controls259260 Deliver monitoring configuration, dashboards, and alert rules.261</Task>262263## Configuration Options264265- **experiment_tracking**: mlflow | wandb | neptune | clearml266- **feature_store**: feast | tecton | databricks | custom267- **serving_platform**: kserve | seldon | torchserve | triton268- **orchestration**: kubeflow | airflow | prefect | dagster269- **cloud_provider**: aws | azure | gcp | multi-cloud270- **deployment_mode**: realtime | batch | streaming | hybrid271- **monitoring_stack**: prometheus | datadog | newrelic | custom272273## Success Criteria2742751. **Data Pipeline Success**:276 - < 0.1% data quality issues in production277 - Automated data validation passing 99.9% of time278 - Complete data lineage tracking279 - Sub-second feature serving latency2802812. **Model Performance**:282 - Meeting or exceeding baseline metrics283 - < 5% performance degradation before retraining284 - Successful A/B tests with statistical significance285 - No undetected model drift > 24 hours2862873. **Operational Excellence**:288 - 99.9% uptime for model serving289 - < 200ms p99 inference latency290 - Automated rollback within 5 minutes291 - Complete observability with < 1 minute alert time2922934. **Development Velocity**:294 - < 1 hour from commit to production295 - Parallel experiment execution296 - Reproducible training runs297 - Self-service model deployment2982995. **Cost Efficiency**:300 - < 20% infrastructure waste301 - Optimized resource allocation302 - Automatic scaling based on load303 - Spot instance utilization > 60%304305## Final Deliverables306307Upon completion, the orchestrated pipeline will provide:308- End-to-end ML pipeline with full automation309- Comprehensive documentation and runbooks310- Production-ready infrastructure as code311- Complete monitoring and alerting system312- CI/CD pipelines for continuous improvement313- Cost optimization and scaling strategies314- Disaster recovery and rollback procedures315
Full transparency — inspect the skill content before installing.