When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
Add this skill
npx mdskills install coreyhaines31/ab-test-setupComprehensive A/B testing workflow with strong rigor gates and clear decision frameworks
1---2name: ab-test-setup3description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.4metadata:5 version: 1.0.06---78# A/B Test Setup910You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.1112## Initial Assessment1314**Check for product marketing context first:**15If `.claude/product-marketing-context.md` exists, read it before asking questions. Use that context and only ask for information not already covered or specific to this task.1617Before designing a test, understand:18191. **Test Context** - What are you trying to improve? What change are you considering?202. **Current State** - Baseline conversion rate? Current traffic volume?213. **Constraints** - Technical complexity? Timeline? Tools available?2223---2425## Core Principles2627### 1. Start with a Hypothesis28- Not just "let's see what happens"29- Specific prediction of outcome30- Based on reasoning or data3132### 2. Test One Thing33- Single variable per test34- Otherwise you don't know what worked3536### 3. Statistical Rigor37- Pre-determine sample size38- Don't peek and stop early39- Commit to the methodology4041### 4. Measure What Matters42- Primary metric tied to business value43- Secondary metrics for context44- Guardrail metrics to prevent harm4546---4748## Hypothesis Framework4950### Structure5152```53Because [observation/data],54we believe [change]55will cause [expected outcome]56for [audience].57We'll know this is true when [metrics].58```5960### Example6162**Weak**: "Changing the button color might increase clicks."6364**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."6566---6768## Test Types6970| Type | Description | Traffic Needed |71|------|-------------|----------------|72| A/B | Two versions, single change | Moderate |73| A/B/n | Multiple variants | Higher |74| MVT | Multiple changes in combinations | Very high |75| Split URL | Different URLs for variants | Moderate |7677---7879## Sample Size8081### Quick Reference8283| Baseline | 10% Lift | 20% Lift | 50% Lift |84|----------|----------|----------|----------|85| 1% | 150k/variant | 39k/variant | 6k/variant |86| 3% | 47k/variant | 12k/variant | 2k/variant |87| 5% | 27k/variant | 7k/variant | 1.2k/variant |88| 10% | 12k/variant | 3k/variant | 550/variant |8990**Calculators:**91- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)92- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)9394**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)9596---9798## Metrics Selection99100### Primary Metric101- Single metric that matters most102- Directly tied to hypothesis103- What you'll use to call the test104105### Secondary Metrics106- Support primary metric interpretation107- Explain why/how the change worked108109### Guardrail Metrics110- Things that shouldn't get worse111- Stop test if significantly negative112113### Example: Pricing Page Test114- **Primary**: Plan selection rate115- **Secondary**: Time on page, plan distribution116- **Guardrail**: Support tickets, refund rate117118---119120## Designing Variants121122### What to Vary123124| Category | Examples |125|----------|----------|126| Headlines/Copy | Message angle, value prop, specificity, tone |127| Visual Design | Layout, color, images, hierarchy |128| CTA | Button copy, size, placement, number |129| Content | Information included, order, amount, social proof |130131### Best Practices132- Single, meaningful change133- Bold enough to make a difference134- True to the hypothesis135136---137138## Traffic Allocation139140| Approach | Split | When to Use |141|----------|-------|-------------|142| Standard | 50/50 | Default for A/B |143| Conservative | 90/10, 80/20 | Limit risk of bad variant |144| Ramping | Start small, increase | Technical risk mitigation |145146**Considerations:**147- Consistency: Users see same variant on return148- Balanced exposure across time of day/week149150---151152## Implementation153154### Client-Side155- JavaScript modifies page after load156- Quick to implement, can cause flicker157- Tools: PostHog, Optimizely, VWO158159### Server-Side160- Variant determined before render161- No flicker, requires dev work162- Tools: PostHog, LaunchDarkly, Split163164---165166## Running the Test167168### Pre-Launch Checklist169- [ ] Hypothesis documented170- [ ] Primary metric defined171- [ ] Sample size calculated172- [ ] Variants implemented correctly173- [ ] Tracking verified174- [ ] QA completed on all variants175176### During the Test177178**DO:**179- Monitor for technical issues180- Check segment quality181- Document external factors182183**DON'T:**184- Peek at results and stop early185- Make changes to variants186- Add traffic from new sources187188### The Peeking Problem189Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.190191---192193## Analyzing Results194195### Statistical Significance196- 95% confidence = p-value < 0.05197- Means <5% chance result is random198- Not a guarantee—just a threshold199200### Analysis Checklist2012021. **Reach sample size?** If not, result is preliminary2032. **Statistically significant?** Check confidence intervals2043. **Effect size meaningful?** Compare to MDE, project impact2054. **Secondary metrics consistent?** Support the primary?2065. **Guardrail concerns?** Anything get worse?2076. **Segment differences?** Mobile vs. desktop? New vs. returning?208209### Interpreting Results210211| Result | Conclusion |212|--------|------------|213| Significant winner | Implement variant |214| Significant loser | Keep control, learn why |215| No significant difference | Need more traffic or bolder test |216| Mixed signals | Dig deeper, maybe segment |217218---219220## Documentation221222Document every test with:223- Hypothesis224- Variants (with screenshots)225- Results (sample, metrics, significance)226- Decision and learnings227228**For templates**: See [references/test-templates.md](references/test-templates.md)229230---231232## Common Mistakes233234### Test Design235- Testing too small a change (undetectable)236- Testing too many things (can't isolate)237- No clear hypothesis238239### Execution240- Stopping early241- Changing things mid-test242- Not checking implementation243244### Analysis245- Ignoring confidence intervals246- Cherry-picking segments247- Over-interpreting inconclusive results248249---250251## Task-Specific Questions2522531. What's your current conversion rate?2542. How much traffic does this page get?2553. What change are you considering and why?2564. What's the smallest improvement worth detecting?2575. What tools do you have for testing?2586. Have you tested this area before?259260---261262## Related Skills263264- **page-cro**: For generating test ideas based on CRO principles265- **analytics-tracking**: For setting up test measurement266- **copywriting**: For creating variant copy267
Full transparency — inspect the skill content before installing.