P-value

Interpret experiment results to understand the probability that observed differences occurred by chance rather than because your changes actually work.

P-value

P-value

definition

Introduction

A p-value is a statistical measure used in A/B testing to determine the probability that the results you've observed occurred by random chance rather than as a result of your test. If you test whether changing your email subject line increases open rate, the p-value tells you how confident you can be that the improvement is real and not a coincidence. A p-value of 0.05 means there's a 5% probability that the result occurred randomly; most researchers consider p-values below 0.05 statistically significant, meaning you can be reasonably confident the effect is real.

Understanding p-values prevents false positives in your growth experiments. Without statistical rigour, many experiments appear to show positive results that actually represent normal variation. For example, if you run 20 small experiments, you'd expect roughly 1 to show positive results by random chance alone - a 5% false positive rate. By requiring statistical significance (p-value below 0.05) and adequate sample size, you avoid investing in changes that don't actually improve performance.

Key concepts related to p-values

  • Null hypothesis: the assumption that your change has no effect; p-value measures evidence against this assumption
  • Sample size: larger samples provide more reliable p-values; small samples are prone to false positives
  • Statistical power: typically you want 80% power (20% probability of missing a real effect) in your experiment design
  • Effect size: the magnitude of the improvement; even statistically significant improvements might be too small to be practically valuable

P-values are commonly misunderstood. A p-value of 0.05 does not mean there's a 95% probability your hypothesis is correct; it means that if you repeated your experiment 100 times and your hypothesis were false, you'd see results this extreme about 5 times by chance. Correct interpretation is essential to avoid wasting resources on experiments with statistical significance but no practical business impact.

Why it matters

For B2B growth teams, proper statistical analysis prevents wasting time and resources on changes that don't matter. A sales team might test a new email template and observe a 3% increase in reply rate; without statistical testing, they'd implement the change across all outreach. If the improvement isn't statistically significant (determined by p-value testing), they've changed processes for a result that could be random variation. Growth teams that require statistical significance before implementing changes maintain higher conversion quality and avoid false positives.

P-value understanding also improves experiment design. Before running an experiment, you should calculate how much sample size you need to reliably detect the effect size you care about. A conversion rate improvement from 2% to 2.1% might be statistically significant with 50,000 visitors, but practically irrelevant - your business might care more about improvements of 0.5%+ that justify the testing effort and implementation cost.

Investors and scaling companies increasingly scrutinise the statistical rigor of your growth claims. Companies that can articulate their experiment design, sample size, and statistical significance appear more credible than those making growth claims based on observed correlations. This is particularly important when explaining disappointing results - if an experiment shows no statistical significance, explaining the methodology helps stakeholders understand you've learned something valuable, not just failed.

How to apply it

Before running an experiment, define your success metric clearly and calculate your sample size requirement. Use a sample size calculator (most are freely available online) to determine how many visitors or observations you need to reliably detect the effect size your business cares about. For a sales email test, if you want to detect a 2% improvement in reply rate and you're currently at 15%, you'd need roughly 3,000 emails in each test group to achieve 80% statistical power with a 0.05 p-value threshold.

Run your experiment for a complete cycle (one week, one sales cycle) rather than stopping early when you see initial positive results. Early stopping creates bias - you're more likely to stop when results favour your hypothesis. This selection bias inflates your false positive rate. If you must stop early due to time constraints, calculate your p-value using sequential analysis methods designed for this purpose, not standard p-value calculations.

After your experiment concludes, calculate your p-value and effect size using your data. If your p-value is below 0.05 and the effect size is meaningful to your business, implement the change. If your p-value is above 0.05, the change shows no statistically significant improvement - don't implement it. If your p-value is below 0.05 but the effect size is small (like 0.5% improvement) and implementation is expensive, assess whether the practical benefit justifies the effort.

SaaS landing page test shows false positive without statistical rigor

A SaaS company tested a new landing page headline and observed a 4% increase in signups after one week (50 signups vs 48 signups). The product team wanted to implement the change immediately. The growth team calculated the p-value and sample size. With only 1,200 visitors per version, they lacked statistical power to confirm the improvement was real. They continued the test for three more weeks and found the improvement had disappeared - the initial result was random variation, not a genuine effect. Without p-value analysis, they would have implemented a change that doesn't work.

Sales consultant discovers meaningful effect size behind significant p-value

A sales consulting firm tested a new sales call structure with enough sample size to achieve p-value of 0.03 (statistically significant). However, the effect size was small: call length increased by 2 minutes on average, but close rates didn't improve. While statistically significant, the practical benefit (longer calls with no higher closes) didn't justify the training effort required. By looking beyond p-value to effect size, they avoided implementing a change that was statistically significant but not meaningful to their business.

Email marketer confirms real effect with proper sample size

An email marketing agency tested a subject line variation and observed a 2.5% increase in open rate. Rather than deploying immediately, the growth team calculated that with 50,000 emails sent to each version, they achieved 80% statistical power to detect a 1.5% improvement, resulting in a p-value of 0.04. The improvement was both statistically significant (p<0.05) and practically meaningful (2.5% improvement in open rates). They rolled the new subject line approach across all campaigns, and the improvement sustained across the next six campaigns.

Keep learning

Growth leadership

How do you make all four engines work together instead of in isolation?

Explore playbooks

Data & dashboards

Data & dashboards

Build the dashboards and data pipelines that show your growth engines in one view so you can spot bottlenecks and make decisions in minutes, not meetings.

Growth team tools

Growth team tools

The wrong tools create friction. The right ones multiply your output without adding complexity. These are the tools I recommend for growth teams that move fast.

Review and plan next cycle

Review and plan next cycle

Analyse last cycle's results across all twelve metrics, identify the highest-leverage improvements, and set priorities that compound into the next period.

Revisit quarterly

Revisit quarterly

Pressure-test your strategy against market shifts, performance data, and team capacity so your direction stays relevant and ambitious.

Related books

No items found.

Related chapters

No items found.

Wiki

Product-market fit

Achieve the state where your product solves a genuine, urgent problem for a defined market that's willing to pay and actively pulling your solution in.

Churn rate

Measure the percentage of customers who stop paying to identify retention problems and calculate the true cost of growth in subscription businesses.

Contact management

Organise customer and prospect information to track relationships, communication history, and next steps without losing context or duplicating effort.

Annual Recurring Revenue (ARR)

Track predictable yearly revenue from subscriptions to measure business scale and growth trajectory in B2B SaaS and recurring revenue models.

Attribution model

Assign credit to marketing touchpoints that influence conversions to understand which channels work together and deserve budget in multi-touch journeys.

Pareto Principle

Focus effort on the 20% of activities that drive 80% of results, systematically eliminating low-yield work to maximise output per hour invested.

Statistical significance

Determine whether experiment results reflect real differences or random chance to avoid making expensive decisions based on noise instead of signal.

Integration

Connect tools so data flows automatically between systems to eliminate manual entry, keep records current, and enable sophisticated workflows across platforms.

Event tracking

Capture specific user actions in your product or website to understand behaviour patterns and measure whether changes improve outcomes or create friction.

Partner-led growth

Scale through partner relationships where other companies distribute your product to their customers in exchange for commissions or reciprocal value.

Cookie

Store information in browsers to track user behaviour across visits and enable personalised experiences without requiring login for every interaction.

Positioning statement

Define how you're different from alternatives in a way that matters to customers to guide all messaging and ensure consistent market perception.

Founder-led growth

Build distribution through your personal brand and network where your expertise and story attract customers who trust you before your company.

Growth drivers

Identify the fundamental factors that directly cause business expansion, concentrating resources on activities that generate measurable results.

Value proposition

Articulate the specific outcome customers get from your solution to communicate why they should choose you over doing nothing or using alternatives.

Total Addressable Market (TAM)

Estimate the maximum revenue opportunity if you captured 100% market share to size your opportunity and prioritise which markets to enter first.

Data warehouse

Store raw data from all business systems in one place to run analyses and build reports that combine information across marketing, sales, and product.

Sales tech stack

Assemble tools that manage pipeline, automate outreach, and track performance to help reps sell more efficiently and managers forecast accurately.

Drip campaign

Send a series of scheduled emails that educate prospects over time to stay top-of-mind without overwhelming them with aggressive sales pitches.

UTMs

Track campaign performance precisely by appending parameters to URLs that identify traffic sources, mediums, and campaigns in your analytics.