Creating strong hypotheses

Most experiments fail before they start because the hypothesis is vague or untestable. Learn how to write hypotheses that are specific enough to prove or disprove and tied to metrics that matter.

Introduction

Most experiments fail not because the idea was wrong, but because the experiment was poorly designed. Companies change something, conversion improves, they assume the change worked. But maybe it was seasonality, or a successful PR campaign, or a competitor raising prices. Without proper controls, you don't know what caused the change.

This chapter shows you how to write testable hypotheses (predicting outcome and mechanism), design experiments with proper controls (isolate variables), set success criteria before running tests (define "winning"), and choose appropriate test structures (A/B, multivariate, holdout groups).

What makes a good hypothesis

VWO

VWO

Rating

Rating

Rating

Rating

Rating

From

393

per month

VWO provides A/B testing, personalisation, and behaviour analytics to optimise website conversion rates through data-driven experimentation.

Hotjar

Hotjar

Rating

Rating

Rating

Rating

Rating

From

39

per month

Hotjar captures user behaviour through heatmaps, session recordings, and feedback polls to understand how visitors use your website.

Microsoft Clarity

Microsoft Clarity

Rating

Rating

Rating

Rating

Rating

From

0

per month

Microsoft Clarity provides free session recordings, heatmaps, and user behaviour analytics without traffic limits or time restrictions.

Notion

Notion

Rating

Rating

Rating

Rating

Rating

From

12

per month

Flexible workspace for docs, wikis, and lightweight databases ideal when you need custom systems without heavy project management overhead.

A testable hypothesis has three components: a specific change, a predicted outcome, and a reasoning for why you expect that outcome.

The format I use is: "If we [specific change], then [metric] will [direction of change] because [reasoning]."

For example: "If we move the pricing table above the fold on the landing page, then demo requests will increase because session recordings show visitors scrolling past the CTA without seeing our pricing."

Each part matters. The specific change tells you what to build. The predicted outcome tells you what to measure. The reasoning tells you what you'll learn regardless of whether the test wins or loses.

A hypothesis without reasoning is just a guess. If the test wins, you don't know why. If it loses, you don't know what was wrong with your thinking. The reasoning is what turns an experiment into learning.

Being specific enough

Vague hypotheses produce vague learnings. "If we improve the landing page, conversions will increase" tells you nothing. Improve how? Increase by how much? Why would that change behaviour?

The dog training analogy is useful here. When training dogs to detect drugs at airports, handlers once made the mistake of using rubber gloves to handle the training materials. The dogs learned to detect the smell of rubber gloves, not drugs. A small, unconscious detail threw off the entire training because the handlers weren't specific enough about what they were actually training for.

The same thing happens in A/B testing. You test a new headline and a new button colour and a new image all at once. The test wins. What did you learn? You have no idea which change mattered, or whether they all mattered, or whether they cancelled each other out and some fourth factor drove the result.

Be specific about what you're changing and why. One change per test where possible. If you must test multiple changes together, at least document what you're bundling and acknowledge that you won't know which element drove the result.

Connecting hypotheses to metrics

Every hypothesis should connect to a metric you actually care about. This sounds obvious, but it's easy to optimise for intermediate metrics that don't translate to revenue.

You might hypothesise that a new email subject line will increase open rates. That's fine as far as it goes. But if open rates go up and click rates stay flat, did you actually improve anything? The metric that matters is further down the funnel.

When writing hypotheses, think through the chain. If this test wins, what happens next? Does that lead to revenue? If the connection is indirect or uncertain, you might be optimising for the wrong thing.

This doesn't mean you can only test bottom-of-funnel metrics. But it means you should be explicit about the assumptions linking your test metric to revenue. "If we increase email open rates, more people will see our offer, which should increase demo requests" makes the chain visible. You can then check whether the chain actually holds.

Common mistakes

Don't stop tests too early or run them too long. Calculate required duration and sample size before starting.

Sample size calculation: Use a sample size calculator (many free online). Input: baseline conversion rate (current performance), minimum detectable effect (smallest improvement you care about), statistical power (typically 80%), significance level (typically 95%). Output: required sample size per variant. Example: baseline 4% conversion, want to detect 10% lift (4% → 4.4%), need ~15,000 visitors per variant (30,000 total). If your page gets 2,000 visitors/month, the test will take 15 months. Not feasible. Either test a higher-traffic page or test a larger effect size.

Test duration calculation: Minimum 2 weeks to account for day-of-week variation (B2B traffic patterns weekly). Minimum through 1 full business cycle (if you're B2B with monthly sales cycles, run test through full month). Maximum 8 weeks (after 8 weeks, external factors change too much to attribute results cleanly). If you can't reach sample size within 8 weeks, either accept lower confidence level or don't run the test.

Early stopping rules: Generally, don't stop tests early. "We're up 15% after 3 days!" is often regression to the mean. But you can set pre-defined stopping rules: if variant is worse by >20% after reaching 50% of required sample size, stop for safety (you're harming conversion). If variant is better by >30% after reaching 75% of sample, you can stop early (result is clear). These rules must be set before starting, not decided during the test.

Simultaneous tests: Can you run multiple tests at once? Yes, but be careful of interactions. Testing homepage headline and pricing page CTA simultaneously is fine (different pages, different visitors). Testing headline and CTA on the same page simultaneously requires multivariate approach (test all combinations). Testing two headline variants on the same page (A/B/C test) splits traffic three ways, requires 50% more traffic to reach significance.

Conclusion

A strong hypothesis is specific, measurable, and grounded in reasoning about why the change will affect behaviour. Writing it down before you test is non-negotiable.

The goal isn't to be right. It's to learn. A well-written hypothesis teaches you something whether the test wins or loses. A vague hypothesis teaches you nothing either way.

Related tools

VWO

Rating

Rating

Rating

Rating

Rating

From

393

per month

VWO

VWO provides A/B testing, personalisation, and behaviour analytics to optimise website conversion rates through data-driven experimentation.

Hotjar

Rating

Rating

Rating

Rating

Rating

From

39

per month

Hotjar

Hotjar captures user behaviour through heatmaps, session recordings, and feedback polls to understand how visitors use your website.

Microsoft Clarity

Rating

Rating

Rating

Rating

Rating

From

0

per month

Microsoft Clarity

Microsoft Clarity provides free session recordings, heatmaps, and user behaviour analytics without traffic limits or time restrictions.

Notion

Rating

Rating

Rating

Rating

Rating

From

12

per month

Notion

Flexible workspace for docs, wikis, and lightweight databases ideal when you need custom systems without heavy project management overhead.

Related wiki articles

A/B testing

Compare two versions of a page, email, or feature to determine which performs better using statistical methods that isolate the impact of specific changes.

Hypothesis testing

Structure experiments around clear predictions to focus efforts on learning rather than random changes and make results easier to interpret afterward.

Control group

Maintain an unchanged version in experiments to isolate the impact of your changes and prove causation rather than correlation with external factors.

Statistical significance

Determine whether experiment results reflect real differences or random chance to avoid making expensive decisions based on noise instead of signal.

Lead capture rate

The percentage of engaged website visitors who submit their contact information and become leads.

Further reading

Experimentation

Experimentation

Most experiments fail before they start because the hypothesis is vague or untestable. Learn how to write hypotheses that are specific enough to prove or disprove and tied to metrics that matter.