Setting up experiments

Introduction

Running experiments properly requires discipline. Most companies peek at results daily, stop tests early when winning, change targeting mid-test when losing, or declare winners based on insufficient data.

This chapter shows you how to maintain test discipline (don't peek, don't stop early), monitor for implementation issues (broken tracking, traffic shifts), segment results to understand which audiences responded, and extract qualitative insights beyond conversion rates.

Top picks

VWO

Rating

From

€

393

per month

VWO provides A/B testing, personalisation, and behaviour analytics to optimise website conversion rates through data-driven experimentation.

Explore tool

My review

Hotjar

Rating

From

€

per month

Hotjar captures user behaviour through heatmaps, session recordings, and feedback polls to understand how visitors use your website.

Explore tool

My review

Microsoft Clarity

Rating

From

€

per month

Microsoft Clarity provides free session recordings, heatmaps, and user behaviour analytics without traffic limits or time restrictions.

Explore tool

My review

Notion

Rating

From

€

per month

Flexible workspace for docs, wikis, and lightweight databases ideal when you need custom systems without heavy project management overhead.

Explore tool

My review

Sample size and duration

Don't peek at results. Checking results daily creates bias. You see variant is up 12% after 3 days, get excited, want to declare winner. But daily variation is normal. What looks like a 12% win after 3 days might be a 2% loss after 14 days. Set a calendar reminder for when the test reaches required sample size, then check results once.

If you absolutely must check (for example, to catch implementation errors), check weekly maximum and set a strict "no decisions until completion" rule. Looking is fine, acting on what you see is not.

Don't stop tests early based on results. "Variant is losing after 5 days, let's stop the test." Unless you set pre-defined stopping rules (variant worse by >20% at 50% sample size), run the full test. Early results are noise. Only exception: technical problems (page broken, tracking failed) warrant stopping.

Don't change tests mid-run. "Variant isn't winning, let's tweak the headline slightly." Now you've invalidated the test. Results are meaningless because you don't know which version drove outcomes. If you think of an improvement mid-test, document it as the next test, don't change the current one.

Don't add post-hoc success metrics. "Conversion didn't improve but scroll depth did, so it's a success." No. You defined primary metric before testing. Stick to it. You can note secondary metrics for learning, but don't retroactively redefine success.

Don't ignore statistical significance. "Variant is up 8% but only 90% confidence, close enough." Either set your threshold at 90% before testing, or wait for 95%. Don't change standards based on what's convenient.

The discipline: write down your test plan (hypothesis, success criteria, sample size, duration) before starting. When you want to stop early or change something, re-read your test plan. Follow it.

Control groups

Technical problems invalidate tests. Monitor for issues daily (this is different from checking results, you're checking that the test is running correctly).

Tracking validation: Verify tracking codes fire correctly on both control and variant. Use browser developer tools or tag manager preview. Common issues: tracking code only fires on control, variant page loads but doesn't track conversions, mobile tracking breaks whilst desktop works. Check all devices and browsers.

Traffic distribution: Verify 50/50 split is actually 50/50. Check daily for first 3 days. If split is 60/40 or worse, something's wrong with randomisation logic. Fix before continuing. Also verify traffic volume matches expectations. If normal traffic is 2,000 visitors/month but you're seeing 500 after a week, investigate (campaign paused? budget exhausted? seasonality?).

Page load time: Variant sometimes loads slower than control (especially if you added heavy elements like video). Monitor load time. If variant averages 5 seconds and control averages 2 seconds, higher bounce rate might be load time, not messaging. Load time differences invalidate the test.

External factors: Monitor for events that affect results. Competitor launches major campaign, your pricing changes, pandemic hits, compliance regulation changes. Document these. They don't necessarily invalidate the test, but they provide context. If conversion drops 30% during test and you know competitor launched aggressive promotion, you can account for that.

Browser and device splits: Verify variant and control get similar browser/device distributions. If control gets 60% mobile and variant gets 40% mobile (due to poor randomisation), results are biased. Randomisation should produce nearly identical distributions.

Platform setup basics

Don't just look at blended results. Segment to understand which audiences responded and which didn't.

Segment by traffic source: LinkedIn ads versus Google search versus remarketing. The variant headline might improve conversion for Google search (+15%) but hurt conversion for LinkedIn ads (-5%). Blended result shows +5% (mild improvement). But the learning is: this headline works for high-intent search traffic, doesn't work for cold outbound. Use it for search, test something different for LinkedIn.

Segment by audience type: Compliance-driven versus proactive versus breach-reactive. The variant might improve compliance-driven conversion (+20%) and have no effect on proactive (0%). Learning: the change addresses compliance-specific belief gaps, not universal ones. Roll it out to compliance traffic, keep testing for other segments.

Segment by device: Desktop versus mobile. Variant might improve desktop conversion (+12%) but hurt mobile conversion (-8%). Learning: variant doesn't work on mobile (maybe headline is too long, maybe proof element doesn't fit on small screen). Implement for desktop only, or fix mobile issues before rolling out.

Segment by time: Week 1 versus week 2 versus week 3 versus week 4. If variant wins strongly in week 1 (+15%) but effect diminishes over time (week 4 only +3%), you've got novelty effect or ad fatigue setting in. Learning: this creative has short lifespan, plan to refresh after 4 weeks.

Create segment reports showing: overall blended result, then results for each key segment. This reveals patterns invisible in blended data.

Pre-launch checklist

Before every experiment, confirm the following:

Hypothesis is written down with specific change, predicted outcome, and reasoning
Sample size calculated and realistic given current traffic
Test duration estimated and blocked in your calendar
Variations built and QA'd across browsers and devices
Tracking confirmed working for both control and variation
No conflicting tests running on the same page or flow
Stakeholders informed so nobody panics when they see the change

This checklist isn't complicated, but skipping any item can invalidate your results. Build the habit of running through it every time.

Conclusion

Maintain test discipline: don't peek at results daily, don't stop early based on outcomes, don't change tests mid-run, don't add post-hoc success metrics, don't ignore significance thresholds. Write down test plan before starting and follow it.

Monitor for implementation issues: verify tracking works correctly, confirm 50/50 traffic split, check page load times, watch for external factors, ensure browser/device distributions match. Technical problems invalidate tests.

Segment results by traffic source, audience type, device, and time. Blended results hide patterns. A variant might work for some segments and hurt others. Segment analysis reveals these patterns and informs rollout decisions.

Extract qualitative insights beyond conversion rates: exit surveys (why people didn't convert), session recordings (how behaviour differed), sales call feedback (lead quality differences), follow-up surveys to converters (what convinced them). Qualitative data explains why tests worked.

Next chapter: document learnings and update operations based on experiments.

Next chapter

4

Analysing and acting on results

Statistical significance is just the beginning. Learn how to interpret results correctly, avoid false positives, and turn winning experiments into permanent improvements across your growth engines.

Read concept