Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.
Add this skill
npx mdskills install sickn33/agent-orchestration-improve-agentComprehensive agent optimization workflow with metrics, A/B testing, and staged deployment protocols
1---2name: agent-orchestration-improve-agent3description: "Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration."4---56# Agent Performance Optimization Workflow78Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.910[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]1112## Use this skill when1314- Improving an existing agent's performance or reliability15- Analyzing failure modes, prompt quality, or tool usage16- Running structured A/B tests or evaluation suites17- Designing iterative optimization workflows for agents1819## Do not use this skill when2021- You are building a brand-new agent from scratch22- There are no metrics, feedback, or test cases available23- The task is unrelated to agent performance or prompt quality2425## Instructions26271. Establish baseline metrics and collect representative examples.282. Identify failure modes and prioritize high-impact fixes.293. Apply prompt and workflow improvements with measurable goals.304. Validate with tests and roll out changes in controlled stages.3132## Safety3334- Avoid deploying prompt changes without regression testing.35- Roll back quickly if quality or safety metrics regress.3637## Phase 1: Performance Analysis and Baseline Metrics3839Comprehensive analysis of agent performance using context-manager for historical data collection.4041### 1.1 Gather Performance Data4243```44Use: context-manager45Command: analyze-agent-performance $ARGUMENTS --days 3046```4748Collect metrics including:4950- Task completion rate (successful vs failed tasks)51- Response accuracy and factual correctness52- Tool usage efficiency (correct tools, call frequency)53- Average response time and token consumption54- User satisfaction indicators (corrections, retries)55- Hallucination incidents and error patterns5657### 1.2 User Feedback Pattern Analysis5859Identify recurring patterns in user interactions:6061- **Correction patterns**: Where users consistently modify outputs62- **Clarification requests**: Common areas of ambiguity63- **Task abandonment**: Points where users give up64- **Follow-up questions**: Indicators of incomplete responses65- **Positive feedback**: Successful patterns to preserve6667### 1.3 Failure Mode Classification6869Categorize failures by root cause:7071- **Instruction misunderstanding**: Role or task confusion72- **Output format errors**: Structure or formatting issues73- **Context loss**: Long conversation degradation74- **Tool misuse**: Incorrect or inefficient tool selection75- **Constraint violations**: Safety or business rule breaches76- **Edge case handling**: Unusual input scenarios7778### 1.4 Baseline Performance Report7980Generate quantitative baseline metrics:8182```83Performance Baseline:84- Task Success Rate: [X%]85- Average Corrections per Task: [Y]86- Tool Call Efficiency: [Z%]87- User Satisfaction Score: [1-10]88- Average Response Latency: [Xms]89- Token Efficiency Ratio: [X:Y]90```9192## Phase 2: Prompt Engineering Improvements9394Apply advanced prompt optimization techniques using prompt-engineer agent.9596### 2.1 Chain-of-Thought Enhancement9798Implement structured reasoning patterns:99100```101Use: prompt-engineer102Technique: chain-of-thought-optimization103```104105- Add explicit reasoning steps: "Let's approach this step-by-step..."106- Include self-verification checkpoints: "Before proceeding, verify that..."107- Implement recursive decomposition for complex tasks108- Add reasoning trace visibility for debugging109110### 2.2 Few-Shot Example Optimization111112Curate high-quality examples from successful interactions:113114- **Select diverse examples** covering common use cases115- **Include edge cases** that previously failed116- **Show both positive and negative examples** with explanations117- **Order examples** from simple to complex118- **Annotate examples** with key decision points119120Example structure:121122```123Good Example:124Input: [User request]125Reasoning: [Step-by-step thought process]126Output: [Successful response]127Why this works: [Key success factors]128129Bad Example:130Input: [Similar request]131Output: [Failed response]132Why this fails: [Specific issues]133Correct approach: [Fixed version]134```135136### 2.3 Role Definition Refinement137138Strengthen agent identity and capabilities:139140- **Core purpose**: Clear, single-sentence mission141- **Expertise domains**: Specific knowledge areas142- **Behavioral traits**: Personality and interaction style143- **Tool proficiency**: Available tools and when to use them144- **Constraints**: What the agent should NOT do145- **Success criteria**: How to measure task completion146147### 2.4 Constitutional AI Integration148149Implement self-correction mechanisms:150151```152Constitutional Principles:1531. Verify factual accuracy before responding1542. Self-check for potential biases or harmful content1553. Validate output format matches requirements1564. Ensure response completeness1575. Maintain consistency with previous responses158```159160Add critique-and-revise loops:161162- Initial response generation163- Self-critique against principles164- Automatic revision if issues detected165- Final validation before output166167### 2.5 Output Format Tuning168169Optimize response structure:170171- **Structured templates** for common tasks172- **Dynamic formatting** based on complexity173- **Progressive disclosure** for detailed information174- **Markdown optimization** for readability175- **Code block formatting** with syntax highlighting176- **Table and list generation** for data presentation177178## Phase 3: Testing and Validation179180Comprehensive testing framework with A/B comparison.181182### 3.1 Test Suite Development183184Create representative test scenarios:185186```187Test Categories:1881. Golden path scenarios (common successful cases)1892. Previously failed tasks (regression testing)1903. Edge cases and corner scenarios1914. Stress tests (complex, multi-step tasks)1925. Adversarial inputs (potential breaking points)1936. Cross-domain tasks (combining capabilities)194```195196### 3.2 A/B Testing Framework197198Compare original vs improved agent:199200```201Use: parallel-test-runner202Config:203 - Agent A: Original version204 - Agent B: Improved version205 - Test set: 100 representative tasks206 - Metrics: Success rate, speed, token usage207 - Evaluation: Blind human review + automated scoring208```209210Statistical significance testing:211212- Minimum sample size: 100 tasks per variant213- Confidence level: 95% (p < 0.05)214- Effect size calculation (Cohen's d)215- Power analysis for future tests216217### 3.3 Evaluation Metrics218219Comprehensive scoring framework:220221**Task-Level Metrics:**222223- Completion rate (binary success/failure)224- Correctness score (0-100% accuracy)225- Efficiency score (steps taken vs optimal)226- Tool usage appropriateness227- Response relevance and completeness228229**Quality Metrics:**230231- Hallucination rate (factual errors per response)232- Consistency score (alignment with previous responses)233- Format compliance (matches specified structure)234- Safety score (constraint adherence)235- User satisfaction prediction236237**Performance Metrics:**238239- Response latency (time to first token)240- Total generation time241- Token consumption (input + output)242- Cost per task (API usage fees)243- Memory/context efficiency244245### 3.4 Human Evaluation Protocol246247Structured human review process:248249- Blind evaluation (evaluators don't know version)250- Standardized rubric with clear criteria251- Multiple evaluators per sample (inter-rater reliability)252- Qualitative feedback collection253- Preference ranking (A vs B comparison)254255## Phase 4: Version Control and Deployment256257Safe rollout with monitoring and rollback capabilities.258259### 4.1 Version Management260261Systematic versioning strategy:262263```264Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]265Example: customer-support-v2.3.1266267MAJOR: Significant capability changes268MINOR: Prompt improvements, new examples269PATCH: Bug fixes, minor adjustments270```271272Maintain version history:273274- Git-based prompt storage275- Changelog with improvement details276- Performance metrics per version277- Rollback procedures documented278279### 4.2 Staged Rollout280281Progressive deployment strategy:2822831. **Alpha testing**: Internal team validation (5% traffic)2842. **Beta testing**: Selected users (20% traffic)2853. **Canary release**: Gradual increase (20% → 50% → 100%)2864. **Full deployment**: After success criteria met2875. **Monitoring period**: 7-day observation window288289### 4.3 Rollback Procedures290291Quick recovery mechanism:292293```294Rollback Triggers:295- Success rate drops >10% from baseline296- Critical errors increase >5%297- User complaints spike298- Cost per task increases >20%299- Safety violations detected300301Rollback Process:3021. Detect issue via monitoring3032. Alert team immediately3043. Switch to previous stable version3054. Analyze root cause3065. Fix and re-test before retry307```308309### 4.4 Continuous Monitoring310311Real-time performance tracking:312313- Dashboard with key metrics314- Anomaly detection alerts315- User feedback collection316- Automated regression testing317- Weekly performance reports318319## Success Criteria320321Agent improvement is successful when:322323- Task success rate improves by ≥15%324- User corrections decrease by ≥25%325- No increase in safety violations326- Response time remains within 10% of baseline327- Cost per task doesn't increase >5%328- Positive user feedback increases329330## Post-Deployment Review331332After 30 days of production use:3333341. Analyze accumulated performance data3352. Compare against baseline and targets3363. Identify new improvement opportunities3374. Document lessons learned3385. Plan next optimization cycle339340## Continuous Improvement Cycle341342Establish regular improvement cadence:343344- **Weekly**: Monitor metrics and collect feedback345- **Monthly**: Analyze patterns and plan improvements346- **Quarterly**: Major version updates with new capabilities347- **Annually**: Strategic review and architecture updates348349Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.350
Full transparency — inspect the skill content before installing.