How Welch's T-Test Works for A/B Testing
Welch's t-test is used to compare the average values of a continuous metric between two groups. In A/B testing, this means comparing metrics like revenue per user, average session duration, or average order value between your control and variant groups. Unlike the chi-squared test (which compares proportions), the t-test handles continuous numerical data.
The T-Test Formula
The t-statistic for Welch's two-sample t-test is:
t = (x̄1 - x̄2) / √(s1²/n1 + s2²/n2)
Where:
- x̄1, x̄2 — sample means of group 1 and group 2
- s1, s2 — sample standard deviations
- n1, n2 — sample sizes
Welch-Satterthwaite Degrees of Freedom
Unlike the standard Student's t-test, Welch's version does not assume equal variances. It adjusts the degrees of freedom using the Welch-Satterthwaite equation:
df = (s1²/n1 + s2²/n2)² / ((s1²/n1)²/(n1-1) + (s2²/n2)²/(n2-1))
This adjustment means the degrees of freedom are typically not a whole number. When the variances are equal, Welch's test gives results very similar to Student's t-test.
Worked Example
Suppose you are testing whether a new checkout flow increases average order value. The control group (n = 200) has a mean of $48.50 with a standard deviation of $12.30. The variant group (n = 200) has a mean of $52.10 with a standard deviation of $14.80.
- Calculate the standard error: SE = √(12.30²/200 + 14.80²/200) = √(0.756 + 1.095) = 1.361
- Calculate the t-statistic: t = (48.50 - 52.10) / 1.361 = -2.645
- Calculate degrees of freedom using Welch-Satterthwaite: df ≈ 383.6
- Look up the two-tailed p-value: p ≈ 0.008
- Result: The difference is statistically significant at the 5% level. The variant's higher average order value is unlikely due to chance.
Understanding the Results
P-Value
The p-value tells you the probability of observing a difference in means as extreme as yours, assuming there is no real difference. A p-value below your threshold (typically 0.05) indicates statistical significance. The interpretation changes based on your hypothesis type:
- Two-tailed — tests if the means are different in either direction
- Left-tailed — tests if group 1's mean is lower than group 2's
- Right-tailed — tests if group 1's mean is higher than group 2's
Confidence Intervals
The confidence interval for each group's mean gives you a plausible range for the true population mean. If the confidence intervals of the two groups do not overlap, this is strong visual evidence of a significant difference — though overlapping intervals do not necessarily mean the difference is non-significant.
Effect Size (Cohen's d)
Cohen's d standardizes the difference between two means by dividing it by the pooled standard deviation. This puts the effect on a universal scale:
- Small (d ≈ 0.2) — the difference is real but subtle
- Medium (d ≈ 0.5) — a clearly noticeable difference
- Large (d ≈ 0.8) — a substantial, obvious difference
Assumptions and Limitations
Welch's t-test assumes that the data within each group is approximately normally distributed. With large samples (n > 30), the Central Limit Theorem makes this assumption less critical. With small samples, non-normality can distort results. The test also assumes that observations are independent — each data point comes from a different user.
Frequently Asked Questions
What is Welch's t-test and when should I use it?
Welch's t-test compares the means (averages) of two groups when the groups may have different variances and different sample sizes. Use it in A/B testing when your metric is a continuous number — like revenue per user, average order value, time on page, or session duration. It is more robust than Student's t-test because it does not assume equal variances.
What is the difference between Welch's t-test and Student's t-test?
Student's t-test assumes that both groups have the same variance (spread). Welch's t-test relaxes this assumption, making it safer to use in practice because A/B test groups often have different variances. Welch's test uses the Welch-Satterthwaite equation to adjust the degrees of freedom, which corrects for unequal variances. This calculator uses Welch's t-test by default.
What does the t-statistic mean?
The t-statistic measures how many standard errors the difference between the two group means is away from zero. A larger absolute t-value means the groups are more different relative to the variability in the data. The t-statistic is used together with the degrees of freedom to calculate the p-value.
Should I use a one-tailed or two-tailed t-test?
Use a two-tailed test in most A/B testing scenarios. A two-tailed test checks whether the variant is different from the control in either direction (higher or lower). Only use a one-tailed test if you are specifically testing whether one group is greater than or less than the other, and you do not care about detecting effects in the opposite direction.
What is Cohen's d and how do I interpret it?
Cohen's d is a standardized measure of the difference between two group means, expressed in units of the pooled standard deviation. It tells you the practical size of the effect, regardless of sample size. Benchmarks: d = 0.2 is a small effect, d = 0.5 is medium, and d = 0.8 is large. A statistically significant result with a very small Cohen's d may not be worth acting on.
How many observations do I need for a reliable t-test?
As a general rule, each group should have at least 30 observations for the t-test to work well. With fewer than 30 per group, the test becomes sensitive to non-normality in the data. With fewer than 5, results are unreliable. For A/B tests, use a sample size calculator to determine the exact number needed based on the effect size you want to detect.