Vague hypotheses produce vague learnings. "If we improve the landing page, conversions will increase" tells you nothing. Improve how? Increase by how much? Why would that change behaviour?
The dog training analogy is useful here. When training dogs to detect drugs at airports, handlers once made the mistake of using rubber gloves to handle the training materials. The dogs learned to detect the smell of rubber gloves, not drugs. A small, unconscious detail threw off the entire training because the handlers weren't specific enough about what they were actually training for.
The same thing happens in A/B testing. You test a new headline and a new button colour and a new image all at once. The test wins. What did you learn? You have no idea which change mattered, or whether they all mattered, or whether they cancelled each other out and some fourth factor drove the result.
Be specific about what you're changing and why. One change per test where possible. If you must test multiple changes together, at least document what you're bundling and acknowledge that you won't know which element drove the result.
Every hypothesis should connect to a metric you actually care about. This sounds obvious, but it's easy to optimise for intermediate metrics that don't translate to revenue.
You might hypothesise that a new email subject line will increase open rates. That's fine as far as it goes. But if open rates go up and click rates stay flat, did you actually improve anything? The metric that matters is further down the funnel.
When writing hypotheses, think through the chain. If this test wins, what happens next? Does that lead to revenue? If the connection is indirect or uncertain, you might be optimising for the wrong thing.
This doesn't mean you can only test bottom-of-funnel metrics. But it means you should be explicit about the assumptions linking your test metric to revenue. "If we increase email open rates, more people will see our offer, which should increase demo requests" makes the chain visible. You can then check whether the chain actually holds.
Don't stop tests too early or run them too long. Calculate required duration and sample size before starting.
Sample size calculation: Use a sample size calculator (many free online). Input: baseline conversion rate (current performance), minimum detectable effect (smallest improvement you care about), statistical power (typically 80%), significance level (typically 95%). Output: required sample size per variant. Example: baseline 4% conversion, want to detect 10% lift (4% → 4.4%), need ~15,000 visitors per variant (30,000 total). If your page gets 2,000 visitors/month, the test will take 15 months. Not feasible. Either test a higher-traffic page or test a larger effect size.
Test duration calculation: Minimum 2 weeks to account for day-of-week variation (B2B traffic patterns weekly). Minimum through 1 full business cycle (if you're B2B with monthly sales cycles, run test through full month). Maximum 8 weeks (after 8 weeks, external factors change too much to attribute results cleanly). If you can't reach sample size within 8 weeks, either accept lower confidence level or don't run the test.
Early stopping rules: Generally, don't stop tests early. "We're up 15% after 3 days!" is often regression to the mean. But you can set pre-defined stopping rules: if variant is worse by >20% after reaching 50% of required sample size, stop for safety (you're harming conversion). If variant is better by >30% after reaching 75% of sample, you can stop early (result is clear). These rules must be set before starting, not decided during the test.
Simultaneous tests: Can you run multiple tests at once? Yes, but be careful of interactions. Testing homepage headline and pricing page CTA simultaneously is fine (different pages, different visitors). Testing headline and CTA on the same page simultaneously requires multivariate approach (test all combinations). Testing two headline variants on the same page (A/B/C test) splits traffic three ways, requires 50% more traffic to reach significance.