Which statistical test should I use?
A practical guide to choosing the right test for your A/B experiment.
Quick decision guide
Answer one question to find the right tool:
Chi-squared test vs t-test
These tests answer the same fundamental question — "is there a real difference between my groups?" — but they work with different types of data.
Chi-squared test
For proportions. Each subject either did or didn't do something.
- Conversion rates
- Click-through rates
- Signup or retention rates
- Any yes/no outcome
Two-sample t-test
For averages. Each subject has a numeric measurement.
- Revenue per user
- Time on page
- Number of items purchased
- Any measured quantity
How to plan an A/B test
Before running an experiment, you need to decide three things:
1. What is your minimum detectable effect (MDE)?
The smallest improvement worth detecting. If your conversion rate is 5%, you might care about a 1 percentage point lift (to 6%) but not a 0.01pp lift. Smaller effects need larger sample sizes to detect.
2. What statistical power do you need?
Power is the probability of detecting a real effect when one exists. The industry standard is 80%. This means if there truly is a difference, you'll detect it 80% of the time. Higher power means larger sample sizes.
3. What significance level (α) will you use?
The probability of a false positive — declaring a winner when there's no real difference. The standard is 5%. Lower significance levels require larger samples.
Common mistakes in A/B testing
Peeking at results early
Checking results before collecting enough data inflates your false positive rate dramatically. If you peek 10 times during a test, your real significance level could be 30% instead of 5%. Decide your sample size in advance and stick to it.
Using the wrong test
Using a t-test on conversion rates or a chi-squared test on revenue data will give unreliable results. Match the test to your data type: proportions get chi-squared, continuous measurements get t-test.
Ignoring practical significance
A result can be statistically significant but practically meaningless. A 0.01% conversion lift might be "significant" with millions of users, but not worth the engineering effort to ship. Always check the effect size alongside the p-value.
Running too many tests
Testing 20 metrics at a 5% significance level means you'll likely get at least one false positive. If you need to test multiple metrics, adjust your significance level (e.g., using a Bonferroni correction) or pre-register your primary metric.