Write hypotheses and design proper experiments

Don't just "try something". Write a hypothesis predicting what will happen and why. Design the experiment with proper controls so you actually learn whether your hypothesis was right.

Introduction

Most experiments fail not because the idea was wrong, but because the experiment was poorly designed. Companies change something, conversion improves, they assume the change worked. But maybe it was seasonality, or a successful PR campaign, or a competitor raising prices. Without proper controls, you don't know what caused the change.

This chapter shows you how to write testable hypotheses (predicting outcome and mechanism), design experiments with proper controls (isolate variables), set success criteria before running tests (define "winning"), and choose appropriate test structures (A/B, multivariate, holdout groups).

Write hypotheses with prediction and mechanism

VWO

VWO

Rating

Rating

Rating

Rating

Rating

From

393

per month

VWO provides A/B testing, personalisation, and behaviour analytics to optimise website conversion rates through data-driven experimentation.

Hotjar

Hotjar

Rating

Rating

Rating

Rating

Rating

From

39

per month

Hotjar captures user behaviour through heatmaps, session recordings, and feedback polls to understand how visitors use your website.

Microsoft Clarity

Microsoft Clarity

Rating

Rating

Rating

Rating

Rating

From

0

per month

Microsoft Clarity provides free session recordings, heatmaps, and user behaviour analytics without traffic limits or time restrictions.

Notion

Notion

Rating

Rating

Rating

Rating

Rating

From

12

per month

Flexible workspace for docs, wikis, and lightweight databases ideal when you need custom systems without heavy project management overhead.

A proper hypothesis has three parts: current belief (what we think is true now), predicted outcome (what we expect to happen), and mechanism (why we think it'll happen).

Bad hypothesis: "Let's test adding testimonials to the landing page." No prediction, no mechanism. This isn't a hypothesis, it's just an action.

Good hypothesis: "Compliance-driven segment doubts that engaging training satisfies auditors. Adding testimonials from compliance officers at similar companies will reduce this doubt and improve lead conversion from 4% to 5%. The mechanism is social proof reducing risk perception."

Now you've stated: what you believe (compliance segment doubts auditor acceptance), what you expect (5% improvement in lead conversion), and why (social proof reduces risk perception). When you run the test, you can evaluate not just whether conversion improved, but whether the mechanism was correct.

If conversion improves but exit surveys show people still doubt auditor acceptance (mechanism was wrong), you've learned something different than if conversion improves and exit surveys show increased confidence in auditor acceptance (mechanism was right). Both results inform future experiments.

Example hypotheses for cybersecurity training:

1. "Proactive segment needs ROI proof to get budget approval. Adding an ROI calculator to the demo page will improve demo booking rate from 8% to 10% by giving them a tool to build the business case internally."

2. "Breach-reactive segment is in crisis mode and needs immediate deployment. Emphasising '30-minute setup' in ad headlines will improve CTR from 1.5% to 2% by addressing their urgency concern."

3. "SQLs aren't becoming opportunities because implementation seems complex. Offering a free pilot (3 users, 30 days) will improve SQL → opportunity conversion from 33% to 40% by reducing perceived risk."

Each hypothesis predicts an outcome, specifies the mechanism, and sets a measurable target. This structure forces clarity about what you're testing and why.

Design experiments with proper controls

Controls isolate variables so you know what caused the change. Without controls, you're just guessing.

A/B test structure: Split traffic 50/50 between control (current version) and variant (your change). Random assignment ensures no bias. Run simultaneously so time-based factors (seasonality, external events) affect both groups equally. Example: 50% of visitors see current landing page headline (control), 50% see new headline (variant). Measure conversion rate for each group.

Common mistakes: testing control first (week 1) then variant (week 2). Time-based testing means you don't know if results are from the change or from the week being different. Always run control and variant simultaneously.

Multivariate test structure: Test multiple elements simultaneously but track every combination. If testing headline (A or B) and CTA (A or B), you need four variants: headline A + CTA A, headline A + CTA B, headline B + CTA A, headline B + CTA B. This reveals interactions (maybe headline B only works with CTA B). But requires 4× the traffic and complexity. Only use for high-traffic pages.

Holdout group structure: For experiments that affect everyone (like changing your pricing model or launching automation), you can't A/B test. Use holdout groups: 10% of customers don't get the change (holdout), 90% do (test). Compare outcomes. Example: you implement automated email nurture for 90% of leads. Hold back 10% as control (manual nurture only). Measure conversion rates. If automated group converts better, the automation worked.

Holdout groups have ethical considerations. Don't withhold valuable improvements from customers just to maintain a control group. But for uncertain experiments, holdouts are acceptable temporarily.

Set success criteria before running experiments

Decide what "winning" means before you see results. This prevents bias ("well, conversion didn't improve but engagement did, so it's a win"). Pre-commit to success criteria.

Primary metric: The one metric that determines success. For landing page test, it's lead conversion rate. For ad creative test, it's cost per lead. For sales process test, it's opportunity conversion rate. Choose one primary metric, not five. Otherwise you're cherry-picking whichever metric looks good.

Secondary metrics: Metrics you'll monitor but that don't determine success. For landing page test, primary metric is lead conversion but secondary metrics are bounce rate, time on page, demo booking rate. These provide context. If lead conversion improves but bounce rate increases, you've attracted wrong-fit leads. If lead conversion improves and bounce rate stays flat, you've genuinely improved the page.

Guardrail metrics: Metrics that must not get worse. For pricing test, primary metric is revenue but guardrail is customer satisfaction. If revenue increases but satisfaction drops below threshold, the test "fails" even though primary metric improved. Guardrails prevent short-term wins that cause long-term damage.

Minimum detectable effect: The smallest improvement worth implementing. If current lead conversion is 4%, is 4.1% worth the effort of implementing the change (2.5% lift)? Probably not. Is 4.4% worth it (10% lift)? Probably yes. Set your threshold (typically 5-10% minimum) before testing. If results fall below the threshold, the test is "neutral" not a win, and you don't bother implementing.

Document success criteria in advance. Write it down: "Primary metric: lead conversion rate. Success threshold: 5% improvement (4% → 4.2%). Secondary metrics: bounce rate, time on page (monitoring only). Guardrail: demo show rate must stay above 70%. Minimum detectable effect: 5% (0.2 percentage points)."

Choose appropriate test duration and sample size

Don't stop tests too early or run them too long. Calculate required duration and sample size before starting.

Sample size calculation: Use a sample size calculator (many free online). Input: baseline conversion rate (current performance), minimum detectable effect (smallest improvement you care about), statistical power (typically 80%), significance level (typically 95%). Output: required sample size per variant. Example: baseline 4% conversion, want to detect 10% lift (4% → 4.4%), need ~15,000 visitors per variant (30,000 total). If your page gets 2,000 visitors/month, the test will take 15 months. Not feasible. Either test a higher-traffic page or test a larger effect size.

Test duration calculation: Minimum 2 weeks to account for day-of-week variation (B2B traffic patterns weekly). Minimum through 1 full business cycle (if you're B2B with monthly sales cycles, run test through full month). Maximum 8 weeks (after 8 weeks, external factors change too much to attribute results cleanly). If you can't reach sample size within 8 weeks, either accept lower confidence level or don't run the test.

Early stopping rules: Generally, don't stop tests early. "We're up 15% after 3 days!" is often regression to the mean. But you can set pre-defined stopping rules: if variant is worse by >20% after reaching 50% of required sample size, stop for safety (you're harming conversion). If variant is better by >30% after reaching 75% of sample, you can stop early (result is clear). These rules must be set before starting, not decided during the test.

Simultaneous tests: Can you run multiple tests at once? Yes, but be careful of interactions. Testing homepage headline and pricing page CTA simultaneously is fine (different pages, different visitors). Testing headline and CTA on the same page simultaneously requires multivariate approach (test all combinations). Testing two headline variants on the same page (A/B/C test) splits traffic three ways, requires 50% more traffic to reach significance.

Conclusion

Write hypotheses with three parts: current belief (what's true now), predicted outcome (what you expect), mechanism (why it'll happen). This structure forces clarity and lets you learn even when tests fail.

Design experiments with proper controls. A/B tests: 50/50 split, simultaneous run, random assignment. Multivariate tests: track all combinations, requires more traffic. Holdout groups: 10% unchanged, 90% receive change, useful when A/B testing isn't possible.

Set success criteria before testing: primary metric (determines win/loss), secondary metrics (provide context), guardrail metrics (must not worsen), minimum detectable effect (smallest improvement worth implementing). Document criteria in advance to prevent bias.

Calculate required test duration and sample size before starting. Minimum 2 weeks, minimum 1 business cycle, maximum 8 weeks. Use sample size calculator to determine if test is feasible given your traffic. Set early stopping rules in advance, not during test.

Next chapter: run experiments with discipline and analyse results properly.

Related tools

VWO

Rating

Rating

Rating

Rating

Rating

From

393

per month

VWO

VWO provides A/B testing, personalisation, and behaviour analytics to optimise website conversion rates through data-driven experimentation.

Hotjar

Rating

Rating

Rating

Rating

Rating

From

39

per month

Hotjar

Hotjar captures user behaviour through heatmaps, session recordings, and feedback polls to understand how visitors use your website.

Microsoft Clarity

Rating

Rating

Rating

Rating

Rating

From

0

per month

Microsoft Clarity

Microsoft Clarity provides free session recordings, heatmaps, and user behaviour analytics without traffic limits or time restrictions.

Notion

Rating

Rating

Rating

Rating

Rating

From

12

per month

Notion

Flexible workspace for docs, wikis, and lightweight databases ideal when you need custom systems without heavy project management overhead.

Related wiki articles

A/B testing

Compare two versions of a page, email, or feature to determine which performs better using statistical methods that isolate the impact of specific changes.

Hypothesis testing

Structure experiments around clear predictions to focus efforts on learning rather than random changes and make results easier to interpret afterward.

Control group

Maintain an unchanged version in experiments to isolate the impact of your changes and prove causation rather than correlation with external factors.

Statistical significance

Determine whether experiment results reflect real differences or random chance to avoid making expensive decisions based on noise instead of signal.

Lead capture rate

The percentage of engaged website visitors who submit their contact information and become leads.

Further reading

Experimentation

Experimentation

Don't just "try something". Write a hypothesis predicting what will happen and why. Design the experiment with proper controls so you actually learn whether your hypothesis was right.