How the Chi-Squared Test Works for A/B Testing
The chi-squared test of independence evaluates whether two groups have different success rates. In A/B testing, this means comparing the conversion rate of your control group against a variant to determine if any observed difference is statistically significant or simply due to random noise.
The Chi-Squared Formula
For a 2×2 contingency table (two groups, each with success/failure counts), the chi-squared statistic is:
χ² = N(ad - bc)² / ((a+c)(b+d)(a+b)(c+d))
Where the contingency table is structured as:
| Success | Failure | |
|---|---|---|
| Group 1 | a | b |
| Group 2 | c | d |
N is the total number of observations (a + b + c + d). The resulting χ² value is compared against the chi-squared distribution with 1 degree of freedom to obtain a p-value.
Worked Example
Suppose you run an A/B test on a checkout page. The control group has 120 conversions out of 1,000 visitors (12%), and the variant has 150 conversions out of 1,000 visitors (15%).
- Build the contingency table: a = 120, b = 880, c = 150, d = 850, N = 2,000
- Calculate: χ² = 2000 × (120 × 850 - 880 × 150)² / (270 × 1730 × 1000 × 1000)
- Result: χ² ≈ 3.87, p-value ≈ 0.049
- At the 5% significance level, this is statistically significant — the variant's higher conversion rate is unlikely to be due to chance.
Understanding the Results
P-Value
The p-value answers the question: “If there were truly no difference between the groups, how likely would we be to see a difference this large?” A p-value below your significance threshold (commonly 0.05) means you can confidently say the groups differ.
Confidence Intervals
This calculator reports Wilson score confidence intervals for each group's conversion rate. Unlike the basic normal approximation, Wilson intervals work well even with small sample sizes or extreme proportions. The confidence interval gives you a plausible range for each group's true conversion rate.
Effect Size (Cohen's h)
Statistical significance alone does not tell you whether a difference matters in practice. Cohen's h quantifies the magnitude of the difference between two proportions on a standardized scale:
- Small (h ≈ 0.2) — detectable but may not be practically meaningful
- Medium (h ≈ 0.5) — a noticeable, meaningful difference
- Large (h ≈ 0.8) — a substantial, obvious difference
When the Chi-Squared Test Breaks Down
The chi-squared test relies on a large-sample approximation. It becomes unreliable when any expected cell count in the contingency table falls below 5. This typically happens with very small samples or very extreme proportions (close to 0% or 100%). In these cases, Fisher's exact test is the preferred alternative. This calculator warns you automatically when expected counts are too low.
Frequently Asked Questions
What is a chi-squared test used for in A/B testing?
A chi-squared test compares the conversion rates (or any success/failure proportions) between two groups to determine if the difference is statistically significant. In A/B testing, it tells you whether your variant's conversion rate is genuinely different from the control, or if the observed difference could be due to random chance.
When should I use a chi-squared test vs. a t-test?
Use a chi-squared test when your metric is a proportion or rate — things like conversion rate, click-through rate, signup rate, or any yes/no outcome. Use a t-test when your metric is a continuous number — things like revenue per user, time on page, average order value, or session duration.
What does the p-value mean in a chi-squared test?
The p-value is the probability of observing a difference as large as (or larger than) the one in your data, assuming there is no real difference between the groups. A small p-value (typically below 0.05) suggests the difference is unlikely to be due to chance alone, and you can reject the null hypothesis that the groups are the same.
What is Cohen's h and why does it matter?
Cohen's h is a measure of effect size for comparing two proportions. While the p-value tells you whether a difference is statistically significant, Cohen's h tells you whether it is practically meaningful. A small h (around 0.2) indicates a small effect, 0.5 is medium, and 0.8 or above is large. A statistically significant result with a tiny effect size may not be worth acting on.
What are the assumptions of the chi-squared test?
The chi-squared test assumes that observations are independent (each user is counted only once), that you are comparing categorical outcomes (success/failure), and that the expected count in each cell of the contingency table is at least 5. When expected counts are too low, the chi-squared approximation becomes unreliable and Fisher's exact test should be used instead.
What is a Wilson score confidence interval?
The Wilson score interval is a method for calculating confidence intervals for proportions that performs better than the basic normal approximation, especially with small sample sizes or proportions near 0% or 100%. This calculator uses Wilson score intervals because they provide more accurate coverage than the simpler Wald interval.