StatTest

Which statistical test should I use?

A practical guide to choosing the right test for your A/B experiment.

Quick decision guide

Answer one question to find the right tool:

Haven't run the experiment yet?
Use the Sample Size Calculator to figure out how many subjects you need before starting.
Is your metric a rate or proportion?
Things like conversion rate, click-through rate, or signup rate. Use the Chi-Squared Test.
Is your metric a continuous number?
Things like revenue per user, time on page, or session duration. Use the Two-Sample T-Test.

Chi-squared test vs t-test

These tests answer the same fundamental question — "is there a real difference between my groups?" — but they work with different types of data.

Chi-squared test

For proportions. Each subject either did or didn't do something.

  • Conversion rates
  • Click-through rates
  • Signup or retention rates
  • Any yes/no outcome

Two-sample t-test

For averages. Each subject has a numeric measurement.

  • Revenue per user
  • Time on page
  • Number of items purchased
  • Any measured quantity

How to plan an A/B test

Before running an experiment, you need to decide three things:

1. What is your minimum detectable effect (MDE)?

The smallest improvement worth detecting. If your conversion rate is 5%, you might care about a 1 percentage point lift (to 6%) but not a 0.01pp lift. Smaller effects need larger sample sizes to detect.

2. What statistical power do you need?

Power is the probability of detecting a real effect when one exists. The industry standard is 80%. This means if there truly is a difference, you'll detect it 80% of the time. Higher power means larger sample sizes.

3. What significance level (α) will you use?

The probability of a false positive — declaring a winner when there's no real difference. The standard is 5%. Lower significance levels require larger samples.

Common mistakes in A/B testing

Peeking at results early

Checking results before collecting enough data inflates your false positive rate dramatically. If you peek 10 times during a test, your real significance level could be 30% instead of 5%. Decide your sample size in advance and stick to it.

Using the wrong test

Using a t-test on conversion rates or a chi-squared test on revenue data will give unreliable results. Match the test to your data type: proportions get chi-squared, continuous measurements get t-test.

Ignoring practical significance

A result can be statistically significant but practically meaningless. A 0.01% conversion lift might be "significant" with millions of users, but not worth the engineering effort to ship. Always check the effect size alongside the p-value.

Running too many tests

Testing 20 metrics at a 5% significance level means you'll likely get at least one false positive. If you need to test multiple metrics, adjust your significance level (e.g., using a Bonferroni correction) or pre-register your primary metric.