P-value

Interpret experiment results to understand the probability that observed differences occurred by chance rather than because your changes actually work.

P-value

P-value

definition

Introduction

A p-value is a statistical measure used in A/B testing to determine the probability that the results you've observed occurred by random chance rather than as a result of your test. If you test whether changing your email subject line increases open rate, the p-value tells you how confident you can be that the improvement is real and not a coincidence. A p-value of 0.05 means there's a 5% probability that the result occurred randomly; most researchers consider p-values below 0.05 statistically significant, meaning you can be reasonably confident the effect is real.

Understanding p-values prevents false positives in your growth experiments. Without statistical rigour, many experiments appear to show positive results that actually represent normal variation. For example, if you run 20 small experiments, you'd expect roughly 1 to show positive results by random chance alone - a 5% false positive rate. By requiring statistical significance (p-value below 0.05) and adequate sample size, you avoid investing in changes that don't actually improve performance.

Key concepts related to p-values

  • Null hypothesis: the assumption that your change has no effect; p-value measures evidence against this assumption
  • Sample size: larger samples provide more reliable p-values; small samples are prone to false positives
  • Statistical power: typically you want 80% power (20% probability of missing a real effect) in your experiment design
  • Effect size: the magnitude of the improvement; even statistically significant improvements might be too small to be practically valuable

P-values are commonly misunderstood. A p-value of 0.05 does not mean there's a 95% probability your hypothesis is correct; it means that if you repeated your experiment 100 times and your hypothesis were false, you'd see results this extreme about 5 times by chance. Correct interpretation is essential to avoid wasting resources on experiments with statistical significance but no practical business impact.

Why it matters

For B2B growth teams, proper statistical analysis prevents wasting time and resources on changes that don't matter. A sales team might test a new email template and observe a 3% increase in reply rate; without statistical testing, they'd implement the change across all outreach. If the improvement isn't statistically significant (determined by p-value testing), they've changed processes for a result that could be random variation. Growth teams that require statistical significance before implementing changes maintain higher conversion quality and avoid false positives.

P-value understanding also improves experiment design. Before running an experiment, you should calculate how much sample size you need to reliably detect the effect size you care about. A conversion rate improvement from 2% to 2.1% might be statistically significant with 50,000 visitors, but practically irrelevant - your business might care more about improvements of 0.5%+ that justify the testing effort and implementation cost.

Investors and scaling companies increasingly scrutinise the statistical rigor of your growth claims. Companies that can articulate their experiment design, sample size, and statistical significance appear more credible than those making growth claims based on observed correlations. This is particularly important when explaining disappointing results - if an experiment shows no statistical significance, explaining the methodology helps stakeholders understand you've learned something valuable, not just failed.

How to apply it

Before running an experiment, define your success metric clearly and calculate your sample size requirement. Use a sample size calculator (most are freely available online) to determine how many visitors or observations you need to reliably detect the effect size your business cares about. For a sales email test, if you want to detect a 2% improvement in reply rate and you're currently at 15%, you'd need roughly 3,000 emails in each test group to achieve 80% statistical power with a 0.05 p-value threshold.

Run your experiment for a complete cycle (one week, one sales cycle) rather than stopping early when you see initial positive results. Early stopping creates bias - you're more likely to stop when results favour your hypothesis. This selection bias inflates your false positive rate. If you must stop early due to time constraints, calculate your p-value using sequential analysis methods designed for this purpose, not standard p-value calculations.

After your experiment concludes, calculate your p-value and effect size using your data. If your p-value is below 0.05 and the effect size is meaningful to your business, implement the change. If your p-value is above 0.05, the change shows no statistically significant improvement - don't implement it. If your p-value is below 0.05 but the effect size is small (like 0.5% improvement) and implementation is expensive, assess whether the practical benefit justifies the effort.

SaaS landing page test shows false positive without statistical rigor

A SaaS company tested a new landing page headline and observed a 4% increase in signups after one week (50 signups vs 48 signups). The product team wanted to implement the change immediately. The growth team calculated the p-value and sample size. With only 1,200 visitors per version, they lacked statistical power to confirm the improvement was real. They continued the test for three more weeks and found the improvement had disappeared - the initial result was random variation, not a genuine effect. Without p-value analysis, they would have implemented a change that doesn't work.

Sales consultant discovers meaningful effect size behind significant p-value

A sales consulting firm tested a new sales call structure with enough sample size to achieve p-value of 0.03 (statistically significant). However, the effect size was small: call length increased by 2 minutes on average, but close rates didn't improve. While statistically significant, the practical benefit (longer calls with no higher closes) didn't justify the training effort required. By looking beyond p-value to effect size, they avoided implementing a change that was statistically significant but not meaningful to their business.

Email marketer confirms real effect with proper sample size

An email marketing agency tested a subject line variation and observed a 2.5% increase in open rate. Rather than deploying immediately, the growth team calculated that with 50,000 emails sent to each version, they achieved 80% statistical power to detect a 1.5% improvement, resulting in a p-value of 0.04. The improvement was both statistically significant (p<0.05) and practically meaningful (2.5% improvement in open rates). They rolled the new subject line approach across all campaigns, and the improvement sustained across the next six campaigns.

Keep learning

Growth leadership

How do you make all four engines work together instead of in isolation?

Explore playbooks

Data & dashboards

Data & dashboards

Build the dashboards and data pipelines that show your growth engines in one view so you can spot bottlenecks and make decisions in minutes, not meetings.

Growth team tools

Growth team tools

The wrong tools create friction. The right ones multiply your output without adding complexity. These are the tools I recommend for growth teams that move fast.

Review and plan next cycle

Review and plan next cycle

Analyse last cycle's results across all twelve metrics, identify the highest-leverage improvements, and set priorities that compound into the next period.

Revisit quarterly

Revisit quarterly

Pressure-test your strategy against market shifts, performance data, and team capacity so your direction stays relevant and ambitious.

Related books

No items found.

Related chapters

No items found.

Wiki

Positioning statement

Define how you're different from alternatives in a way that matters to customers to guide all messaging and ensure consistent market perception.

Pareto Principle

Focus effort on the 20% of activities that drive 80% of results, systematically eliminating low-yield work to maximise output per hour invested.

Last-touch attribution

Assign full conversion credit to the final touchpoint before purchase to identify which channels close deals but miss earlier influences that started journeys.

API

Enable tools to exchange data programmatically so you can build custom integrations and automate processes that vendor-built integrations don't support.

Pipeline coverage

Calculate how much pipeline you need relative to quota to ensure you generate enough opportunities to hit revenue targets despite normal conversion rates.

Growth plateau

Diagnose and break through stagnation by identifying which business mechanisms have reached capacity and require new approaches.

Compound growth rate

Calculate your true growth trajectory by measuring the rate at which your business grows when gains build on previous gains over multiple periods.

Multi-touch attribution

Distribute conversion credit across multiple touchpoints to recognise that customer journeys involve many interactions and channels working together.

Unit economics

Analyse profit per customer to determine if your business model works at scale before investing heavily in growth and customer acquisition.

Pirate metrics

Track your user journey through Acquisition, Activation, Retention, Referral, and Revenue to identify which stage constrains growth most.

Founder-led growth

Build distribution through your personal brand and network where your expertise and story attract customers who trust you before your company.

Trigger

Define events that start automation workflows so the right message reaches people at the right moment based on their actual behaviour not arbitrary timing.

Hypothesis testing

Structure experiments around clear predictions to focus efforts on learning rather than random changes and make results easier to interpret afterward.

Eisenhower Matrix

Prioritise tasks systematically by sorting them into urgent-important quadrants, focusing effort on high-impact activities.

Growth drivers

Identify the fundamental factors that directly cause business expansion, concentrating resources on activities that generate measurable results.

Sales tech stack

Assemble tools that manage pipeline, automate outreach, and track performance to help reps sell more efficiently and managers forecast accurately.

Control group

Maintain an unchanged version in experiments to isolate the impact of your changes and prove causation rather than correlation with external factors.

Growth engine

Build self-reinforcing systems across demand generation, funnel conversion, sales pipeline, and customer value that create continuous momentum.

Annual Recurring Revenue (ARR)

Track predictable yearly revenue from subscriptions to measure business scale and growth trajectory in B2B SaaS and recurring revenue models.

Event tracking

Capture specific user actions in your product or website to understand behaviour patterns and measure whether changes improve outcomes or create friction.