Technical problems invalidate tests. Monitor for issues daily (this is different from checking results, you're checking that the test is running correctly).
Tracking validation: Verify tracking codes fire correctly on both control and variant. Use browser developer tools or tag manager preview. Common issues: tracking code only fires on control, variant page loads but doesn't track conversions, mobile tracking breaks whilst desktop works. Check all devices and browsers.
Traffic distribution: Verify 50/50 split is actually 50/50. Check daily for first 3 days. If split is 60/40 or worse, something's wrong with randomisation logic. Fix before continuing. Also verify traffic volume matches expectations. If normal traffic is 2,000 visitors/month but you're seeing 500 after a week, investigate (campaign paused? budget exhausted? seasonality?).
Page load time: Variant sometimes loads slower than control (especially if you added heavy elements like video). Monitor load time. If variant averages 5 seconds and control averages 2 seconds, higher bounce rate might be load time, not messaging. Load time differences invalidate the test.
External factors: Monitor for events that affect results. Competitor launches major campaign, your pricing changes, pandemic hits, compliance regulation changes. Document these. They don't necessarily invalidate the test, but they provide context. If conversion drops 30% during test and you know competitor launched aggressive promotion, you can account for that.
Browser and device splits: Verify variant and control get similar browser/device distributions. If control gets 60% mobile and variant gets 40% mobile (due to poor randomisation), results are biased. Randomisation should produce nearly identical distributions.
Don't just look at blended results. Segment to understand which audiences responded and which didn't.
Segment by traffic source: LinkedIn ads versus Google search versus remarketing. The variant headline might improve conversion for Google search (+15%) but hurt conversion for LinkedIn ads (-5%). Blended result shows +5% (mild improvement). But the learning is: this headline works for high-intent search traffic, doesn't work for cold outbound. Use it for search, test something different for LinkedIn.
Segment by audience type: Compliance-driven versus proactive versus breach-reactive. The variant might improve compliance-driven conversion (+20%) and have no effect on proactive (0%). Learning: the change addresses compliance-specific belief gaps, not universal ones. Roll it out to compliance traffic, keep testing for other segments.
Segment by device: Desktop versus mobile. Variant might improve desktop conversion (+12%) but hurt mobile conversion (-8%). Learning: variant doesn't work on mobile (maybe headline is too long, maybe proof element doesn't fit on small screen). Implement for desktop only, or fix mobile issues before rolling out.
Segment by time: Week 1 versus week 2 versus week 3 versus week 4. If variant wins strongly in week 1 (+15%) but effect diminishes over time (week 4 only +3%), you've got novelty effect or ad fatigue setting in. Learning: this creative has short lifespan, plan to refresh after 4 weeks.
Create segment reports showing: overall blended result, then results for each key segment. This reveals patterns invisible in blended data.
Before every experiment, confirm the following:
- Hypothesis is written down with specific change, predicted outcome, and reasoning
- Sample size calculated and realistic given current traffic
- Test duration estimated and blocked in your calendar
- Variations built and QA'd across browsers and devices
- Tracking confirmed working for both control and variation
- No conflicting tests running on the same page or flow
- Stakeholders informed so nobody panics when they see the change
This checklist isn't complicated, but skipping any item can invalidate your results. Build the habit of running through it every time.