
Hypothesis testing is a cornerstone of inferential statistics that helps researchers make decisions about populations based on sample data. Whether you’re analyzing the effectiveness of a new medication, comparing marketing strategies, or evaluating machine learning models, hypothesis testing provides a rigorous framework for drawing conclusions from data. In this post, I’ll explain the concepts and steps involved in hypothesis testing with practical examples in Python.
What is Hypothesis Testing?
Hypothesis testing is a method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample. It’s a systematic way to decide whether we have enough evidence to reject a default position (the null hypothesis) in favor of an alternative hypothesis.
The Hypothesis Testing Framework
The hypothesis testing framework follows these steps:
- State the hypotheses:
- Null Hypothesis (H₀): A statement of no effect, relationship, or difference
- Alternative Hypothesis (H₁ or Hₐ): A statement that contradicts the null hypothesis
-
Choose a significance level (α): The probability of rejecting the null hypothesis when it’s actually true (Type I error). Common values are 0.05, 0.01, and 0.001.
-
Calculate the test statistic: Based on sample data, compute a value that measures how far the sample data deviates from what we’d expect under the null hypothesis.
-
Determine the p-value: The probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true.
- Make a decision:
- If p-value ≤ α, reject the null hypothesis (evidence supports the alternative hypothesis)
- If p-value > α, fail to reject the null hypothesis (insufficient evidence to support the alternative hypothesis)
- Draw a conclusion: Interpret the results in the context of the original problem.
Let’s explore this framework with examples using Python.
Setting Up the Environment
First, let’s import the necessary libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import chi2_contingency
# Set the random seed for reproducibility
np.random.seed(42)
# Set Seaborn style
sns.set(style="whitegrid")
Types of Hypothesis Tests
There are many types of hypothesis tests, each designed for specific situations. Here, we’ll cover some of the most common ones:
1. One-Sample T-Test
A one-sample t-test is used to determine whether a sample mean is statistically different from a known or hypothesized population mean.
Example: Testing whether the average height of a sample of students differs from the national average of or 170 cm.
# Generate sample data: heights of 30 students with mean=172 and std=5
student_heights = np.random.normal(172, 5, 30)
# Hypotheses:
# H₀: μ = 170 (sample mean equals national average)
# H₁: μ ≠ 170 (sample mean differs from national average)
# Perform the one-sample t-test
t_stat, p_value = stats.ttest_1samp(student_heights, 170)
# Print results
print(f"One-Sample T-Test Results:")
print(f"Sample Mean: {np.mean(student_heights):.2f} cm")
print(f"Hypothesized Mean: 170.00 cm")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
# Make decision (using α = 0.05)
if p_value <= 0.05:
conclusion = "Reject the null hypothesis. The sample mean differs significantly from 170 cm."
else:
conclusion = "Fail to reject the null hypothesis. No significant difference from 170 cm."
print(f"Conclusion: {conclusion}")
# Visualize
plt.figure(figsize=(10, 6))
sns.histplot(student_heights, kde=True)
plt.axvline(170, color='red', linestyle='--', label="Hypothesized Mean (170 cm)")
plt.axvline(np.mean(student_heights), color='green', linestyle='-', label=f"Sample Mean ({np.mean(student_heights):.2f} cm)")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.title("Distribution of Student Heights")
plt.legend()
plt.show()
2. Two-Sample T-Test (Independent Samples)
The two-sample t-test is used to determine whether the means of two independent groups differ.
Example: Testing whether there’s a difference in test scores between two teaching methods.
# Generate data for two teaching methods
method_A_scores = np.random.normal(75, 10, 40) # Method A: mean=75, std=10, n=40
method_B_scores = np.random.normal(80, 10, 35) # Method B: mean=80, std=10, n=35
# Hypotheses:
# H₀: μ₁ = μ₂ (the means are equal)
# H₁: μ₁ ≠ μ₂ (the means are different)
# Perform the two-sample t-test (assuming equal variances)
t_stat, p_value = stats.ttest_ind(method_A_scores, method_B_scores, equal_var=True)
# Print results
print("Two-Sample T-Test Results:")
print(f"Method A Mean: {np.mean(method_A_scores):.2f}")
print(f"Method B Mean: {np.mean(method_B_scores):.2f}")
print(f"Difference: {np.mean(method_B_scores) - np.mean(method_A_scores):.2f}")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
# Make decision (using α = 0.05)
if p_value <= 0.05:
conclusion = "Reject the null hypothesis. There is a significant difference between the means."
else:
conclusion = "Fail to reject the null hypothesis. No significant difference between the means."
print(f"Conclusion: {conclusion}")
# Visualize
plt.figure(figsize=(10, 6))
sns.kdeplot(method_A_scores, fill=True, label="Method A")
sns.kdeplot(method_B_scores, fill=True, label="Method B")
plt.axvline(np.mean(method_A_scores), color='blue', linestyle='--', label=f"Method A Mean ({np.mean(method_A_scores):.2f})")
plt.axvline(np.mean(method_B_scores), color='orange', linestyle='--', label=f"Method B Mean ({np.mean(method_B_scores):.2f})")
plt.xlabel("Test Scores")
plt.ylabel("Density")
plt.title("Distribution of Test Scores by Teaching Method")
plt.legend()
plt.show()
3. Paired T-Test
A paired t-test is used when you have two related samples and want to compare their means.
Example: Testing the effectiveness of a training program by comparing scores before and after training.
# Generate before-after data for 25 participants
before_scores = np.random.normal(65, 15, 25)
improvement = np.random.normal(10, 5, 25) # Average improvement of 10 points
after_scores = before_scores + improvement
# Hypotheses:
# H₀: μd = 0 (no difference between before and after)
# H₁: μd ≠ 0 (there is a difference between before and after)
# Perform the paired t-test
t_stat, p_value = stats.ttest_rel(after_scores, before_scores)
# Print results
print("Paired T-Test Results:")
print(f"Mean Before: {np.mean(before_scores):.2f}")
print(f"Mean After: {np.mean(after_scores):.2f}")
print(f"Mean Difference: {np.mean(after_scores - before_scores):.2f}")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
# Make decision (using α = 0.05)
if p_value <= 0.05:
conclusion = "Reject the null hypothesis. The training program has a significant effect."
else:
conclusion = "Fail to reject the null hypothesis. No significant effect of the training program."
print(f"Conclusion: {conclusion}")
# Visualize
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.boxplot(data=[before_scores, after_scores], width=0.4)
plt.xticks([0, 1], ['Before', 'After'])
plt.ylabel("Scores")
plt.title("Before vs. After Training Scores")
plt.subplot(1, 2, 2)
differences = after_scores - before_scores
sns.histplot(differences, kde=True)
plt.axvline(0, color='red', linestyle='--', label="No Difference")
plt.axvline(np.mean(differences), color='green', linestyle='-', label=f"Mean Difference ({np.mean(differences):.2f})")
plt.xlabel("Difference (After - Before)")
plt.ylabel("Frequency")
plt.title("Distribution of Score Differences")
plt.legend()
plt.tight_layout()
plt.show()
4. Chi-Square Test for Independence
The chi-square test for independence is used to determine whether there is a significant relationship between two categorical variables.
Example: Testing whether there’s a relationship between gender and preference for a product.
# Create a contingency table
contingency_table = pd.DataFrame({
'Prefer Product A': [40, 30],
'Prefer Product B': [30, 50],
'No Preference': [10, 20]
}, index=['Male', 'Female'])
print("Contingency Table:")
print(contingency_table)
# Hypotheses:
# H₀: Gender and product preference are independent
# H₁: Gender and product preference are not independent
# Perform the chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
# Print results
print("\nChi-Square Test for Independence Results:")
print(f"Chi-Square Statistic: {chi2:.4f}")
print(f"P-Value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
# Make decision (using α = 0.05)
if p_value <= 0.05:
conclusion = "Reject the null hypothesis. There is a significant relationship between gender and product preference."
else:
conclusion = "Fail to reject the null hypothesis. No significant relationship between gender and product preference."
print(f"\nConclusion: {conclusion}")
# Visualize
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
contingency_table.plot(kind='bar', stacked=True)
plt.title("Observed Frequencies")
plt.ylabel("Count")
plt.subplot(1, 2, 2)
pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns).plot(kind='bar', stacked=True)
plt.title("Expected Frequencies (if independent)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
5. ANOVA (Analysis of Variance)
ANOVA is used to compare means across three or more groups.
Example: Testing whether plant growth differs significantly among three different fertilizers.
# Generate data for three fertilizer types
fertilizer_A = np.random.normal(20, 5, 20) # mean growth: 20 cm
fertilizer_B = np.random.normal(25, 5, 20) # mean growth: 25 cm
fertilizer_C = np.random.normal(22, 5, 20) # mean growth: 22 cm
# Create a DataFrame
df = pd.DataFrame({
'Growth': np.concatenate([fertilizer_A, fertilizer_B, fertilizer_C]),
'Fertilizer': np.repeat(['A', 'B', 'C'], 20)
})
# Hypotheses:
# H₀: μ₁ = μ₂ = μ₃ (all means are equal)
# H₁: At least one mean is different
# Perform one-way ANOVA
model = ols('Growth ~ Fertilizer', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
# Print results
print("One-Way ANOVA Results:")
print(anova_table)
# Extract p-value
p_value = anova_table['PR(>F)']['Fertilizer']
# Make decision (using α = 0.05)
if p_value <= 0.05:
conclusion = "Reject the null hypothesis. There are significant differences in plant growth among fertilizers."
else:
conclusion = "Fail to reject the null hypothesis. No significant differences in plant growth among fertilizers."
print(f"\nConclusion: {conclusion}")
# If ANOVA is significant, perform post-hoc tests (Tukey's HSD)
if p_value <= 0.05:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(df['Growth'], df['Fertilizer'], alpha=0.05)
print("\nTukey's HSD Post-hoc Test:")
print(tukey)
# Visualize
plt.figure(figsize=(10, 6))
sns.boxplot(x='Fertilizer', y='Growth', data=df)
sns.swarmplot(x='Fertilizer', y='Growth', data=df, color='black', alpha=0.6)
plt.title(f"Plant Growth by Fertilizer Type (ANOVA p-value: {p_value:.4f})")
plt.xlabel("Fertilizer Type")
plt.ylabel("Growth (cm)")
plt.show()
Understanding P-Values and Significance Testing
The p-value is often misunderstood. Here’s a clearer explanation:
- The p-value is the probability of observing a test statistic at least as extreme as the one calculated, assuming the null hypothesis is true.
- It is NOT the probability that the null hypothesis is true.
- A small p-value (≤ α) indicates that the observed data is unlikely under the null hypothesis, providing evidence against it.
- A large p-value (> α) does not prove the null hypothesis is true; it only indicates insufficient evidence to reject it.
Visualizing the P-Value
Let’s visualize the p-value concept for a one-sample t-test:
# Generate a sample from a normal distribution
sample = np.random.normal(172, 5, 30)
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)
n = len(sample)
# Hypothesized mean
mu0 = 170
# Calculate t-statistic
t_stat = (sample_mean - mu0) / (sample_std / np.sqrt(n))
# Create a range of t-values
t_values = np.linspace(-4, 4, 1000)
df = n - 1 # degrees of freedom
t_dist = stats.t.pdf(t_values, df)
# Calculate p-value (two-tailed test)
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(t_values, t_dist, label='t-distribution')
plt.fill_between(t_values, t_dist, where=(t_values >= abs(t_stat)), color='red', alpha=0.3, label=f'p-value = {p_value:.4f}')
plt.fill_between(t_values, t_dist, where=(t_values <= -abs(t_stat)), color='red', alpha=0.3)
plt.axvline(t_stat, color='blue', linestyle='--', label=f't-statistic = {t_stat:.4f}')
plt.axvline(-t_stat, color='blue', linestyle='--')
plt.axvline(0, color='black', linestyle='-', label='Null hypothesis')
plt.xlabel('t-value')
plt.ylabel('Probability Density')
plt.title('Visualizing the P-Value for a Two-Tailed T-Test')
plt.legend()
plt.show()
Type I and Type II Errors
In hypothesis testing, two types of errors can occur:
- Type I Error (False Positive): Rejecting a null hypothesis when it’s actually true.
- Probability = α (significance level)
- Type II Error (False Negative): Failing to reject a null hypothesis when it’s actually false.
- Probability = β
- Related to statistical power (1 - β), which is the probability of correctly rejecting a false null hypothesis.
Let’s visualize these concepts:
# Define parameters
mu0 = 0 # Mean under null hypothesis
mu1 = 1 # Mean under alternative hypothesis
sigma = 1 # Standard deviation
n = 20 # Sample size
alpha = 0.05 # Significance level
# Calculate the critical value for a one-sided test
z_critical = stats.norm.ppf(1 - alpha)
critical_value = mu0 + z_critical * sigma / np.sqrt(n)
# Calculate beta (Type II error probability)
beta = stats.norm.cdf(critical_value, loc=mu1, scale=sigma/np.sqrt(n))
power = 1 - beta
# Create a range of values for plotting
x = np.linspace(-3, 4, 1000)
y_null = stats.norm.pdf(x, loc=mu0, scale=sigma/np.sqrt(n))
y_alt = stats.norm.pdf(x, loc=mu1, scale=sigma/np.sqrt(n))
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(x, y_null, 'blue', label='Null Distribution')
plt.plot(x, y_alt, 'green', label='Alternative Distribution')
plt.axvline(critical_value, color='red', linestyle='--', label=f'Critical Value = {critical_value:.2f}')
# Fill the areas representing errors
plt.fill_between(x, 0, y_null, where=(x >= critical_value), color='red', alpha=0.3, label=f'Type I Error (α = {alpha:.2f})')
plt.fill_between(x, 0, y_alt, where=(x <= critical_value), color='gray', alpha=0.3, label=f'Type II Error (β = {beta:.2f})')
plt.xlabel('Sample Mean')
plt.ylabel('Probability Density')
plt.title('Type I and Type II Errors in Hypothesis Testing')
plt.legend()
plt.show()
Statistical Power and Sample Size
Statistical power is the probability of detecting an effect when there is one. It’s influenced by:
- Sample size (n)
- Effect size (magnitude of the difference)
- Significance level (α)
- Variability in the data (standard deviation)
Let’s explore how sample size affects power:
# Define parameters
mu0 = 0 # Mean under null hypothesis
mu1 = 0.5 # Mean under alternative hypothesis (effect size)
sigma = 1 # Standard deviation
alpha = 0.05 # Significance level
# Sample sizes to consider
sample_sizes = np.arange(5, 100, 5)
powers = []
# Calculate power for each sample size
for n in sample_sizes:
# Critical value for a one-sided test
z_critical = stats.norm.ppf(1 - alpha)
critical_value = mu0 + z_critical * sigma / np.sqrt(n)
# Type II error probability
beta = stats.norm.cdf(critical_value, loc=mu1, scale=sigma/np.sqrt(n))
power = 1 - beta
powers.append(power)
# Visualize
plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, powers, marker='o')
plt.axhline(0.8, color='red', linestyle='--', label='Conventional Power = 0.8')
plt.xlabel('Sample Size')
plt.ylabel('Statistical Power')
plt.title('Effect of Sample Size on Statistical Power')
plt.grid(True)
plt.legend()
plt.show()
Common Pitfalls in Hypothesis Testing
-
Misinterpreting the p-value: A small p-value doesn’t prove your alternative hypothesis; it only provides evidence against the null.
- Multiple testing problem: Conducting many tests increases the chance of finding significant results by chance. Solutions include:
- Bonferroni correction: Adjust α to α/m where m is the number of tests
- False Discovery Rate (FDR) control
- Holm-Bonferroni method
-
Confusing statistical significance with practical significance: A result can be statistically significant but trivial in practical terms.
- Using hypothesis testing when it’s not appropriate: Not all research questions require hypothesis testing.
Conclusion
Hypothesis testing is a powerful framework for making inferences from data, but it must be applied carefully and with understanding of its assumptions and limitations. In this post, we’ve covered:
- The fundamental concepts of hypothesis testing
- The most common types of hypothesis tests
- How to interpret p-values correctly
- Type I and Type II errors and their relationship to statistical power
- Common pitfalls to avoid
In future posts, we’ll explore more advanced statistical techniques that build upon these foundations, such as regression analysis, non-parametric tests, and Bayesian approaches to hypothesis testing.