Central Limit Theorem: Part 2

Jan 15 · 10min

Quick Recap

New to the Central Limit Theorem? Start with Part 1 first!

For everyone else: The CLT tells us that sample means become normally distributed, regardless of the original data distribution. With samples of n ≥ 30, we can make confident statements about populations using statistics. Now let’s put this power to work!

Real-World Case Study: Battery Factory Quality Control

The Scenario

You’re a quality control manager at a smartphone battery factory. Your boss wants batteries that last at least 20 hours on average, but testing every single battery would be expensive and time-consuming. Plus, some tests are destructive!

The Challenge:

  • 🏭 Population: Millions of batteries produced daily
  • Unknown: True population mean battery life
  • 🎯 Goal: Determine if a batch meets the 20-hour requirement
  • 💰 Constraint: Can only test a small sample due to cost

Enter the CLT Hero

The Central Limit Theorem saves the day! Here’s how:

  1. Sample: Test 50 batteries from a batch (n = 50 > 30 ✓)
  2. Results: Sample mean = 20.3 hours, sample std = 2.1 hours
  3. Apply CLT: The sampling distribution of means is approximately normal
  4. Make Decision: Use confidence intervals to assess the entire batch

The Mathematical Solution

Using the CLT, we can construct a 95% confidence interval:

Result: We’re 95% confident the true population mean is between 19.70 and 20.90 hours.

Decision: Since the entire confidence interval is above 20 hours, we can confidently approve this batch for shipment! 🚀

Why This Works (The CLT Magic)

  1. Large Enough Sample: n = 50 is sufficient for CLT to kick in
  2. Normal Distribution: Sample means follow a normal distribution regardless of how individual battery lives are distributed
  3. Predictable Precision: Standard error = decreases as sample size increases
  4. Quantified Uncertainty: We know exactly how confident we can be

Complete Code Implementation

import numpy as np
from scipy import stats
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Simulate the battery testing scenario
np.random.seed(42)

# Create a realistic battery population (slightly skewed, mean ~20.3)
true_population = np.random.gamma(shape=4, scale=5.075, size=100000)

# Sample 50 batteries for testing
sample_data = np.random.choice(true_population, 50)

# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)  # ddof=1 for sample std
n = len(sample_data)

# 95% confidence interval using CLT
# Using t-distribution since we don't know population std
margin_of_error = stats.t.ppf(0.975, n-1) * (sample_std / np.sqrt(n))
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

# Display results
print("🔋 Battery Quality Control Results")
print("=" * 40)
print(f"Sample size: {n} batteries")
print(f"Sample mean: {sample_mean:.2f} hours")
print(f"Sample std: {sample_std:.2f} hours")
print(f"95% Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}] hours")
print(f"Meets 20-hour requirement: {'✅ YES' if ci_lower > 20 else '❌ NO'}")
print(f"Margin of error: ±{margin_of_error:.2f} hours")

# Create visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=["Population Distribution", "Sample Data",
                   "Confidence Interval", "Sampling Distribution"],
    specs=[[{"type": "histogram"}, {"type": "histogram"}],
           [{"type": "scatter"}, {"type": "histogram"}]]
)

# Population distribution
fig.add_trace(
    go.Histogram(x=true_population[:1000], name="Population",
                histnorm="probability density", showlegend=False),
    row=1, col=1
)

# Sample data
fig.add_trace(
    go.Histogram(x=sample_data, name="Sample",
                histnorm="probability density", showlegend=False),
    row=1, col=2
)

# Confidence interval visualization
ci_x = [ci_lower, ci_upper, ci_upper, ci_lower, ci_lower]
ci_y = [0, 0, 1, 1, 0]
fig.add_trace(
    go.Scatter(x=ci_x, y=ci_y, fill="toself", name="95% CI",
              fillcolor="lightblue", line=dict(color="blue")),
    row=2, col=1
)
fig.add_vline(x=sample_mean, line=dict(color="red", width=3),
              annotation_text=f"Sample Mean: {sample_mean:.2f}h",
              row=2, col=1)
fig.add_vline(x=20, line=dict(color="green", width=2, dash="dash"),
              annotation_text="Requirement: 20h",
              row=2, col=1)

# Sampling distribution (theoretical)
x_range = np.linspace(sample_mean - 3*margin_of_error,
                     sample_mean + 3*margin_of_error, 100)
sampling_dist = stats.norm.pdf(x_range, sample_mean, sample_std/np.sqrt(n))
fig.add_trace(
    go.Scatter(x=x_range, y=sampling_dist, mode="lines",
              name="Sampling Distribution", line=dict(color="purple")),
    row=2, col=2
)

fig.update_layout(height=600, title="CLT in Action: Battery Quality Control")
fig.show()

Confidence Intervals: Your Statistical Superpower

Once you understand CLT, confidence intervals become your go-to tool for making decisions with uncertainty. The general formula is:

Where the margin of error depends on:

  • Confidence Level: How sure do you want to be? (90%, 95%, 99%)
  • Sample Size: Larger samples = smaller margin of error
  • Variability: More spread in data = larger margin of error

Confidence Level Trade-offs

# Compare different confidence levels
confidence_levels = [0.90, 0.95, 0.99]
colors = ['green', 'blue', 'red']

fig = go.Figure()

for i, conf_level in enumerate(confidence_levels):
    alpha = 1 - conf_level
    t_critical = stats.t.ppf(1 - alpha/2, n-1)
    margin = t_critical * (sample_std / np.sqrt(n))

    # Add confidence interval
    fig.add_shape(
        type="rect",
        x0=sample_mean - margin, x1=sample_mean + margin,
        y0=i*0.3, y1=(i+1)*0.3,
        fillcolor=colors[i], opacity=0.3,
        line=dict(color=colors[i], width=2)
    )

    fig.add_annotation(
        x=sample_mean, y=i*0.3 + 0.15,
        text=f"{conf_level*100:.0f}%: ±{margin:.2f}",
        showarrow=False
    )

fig.add_vline(x=sample_mean, line=dict(color="black", width=2))
fig.add_vline(x=20, line=dict(color="orange", width=2, dash="dash"))

fig.update_layout(
    title="Confidence Level Trade-offs",
    xaxis_title="Battery Life (hours)",
    yaxis_title="Confidence Level",
    height=400
)
fig.show()

Key Insight: Higher confidence = wider intervals. There’s always a trade-off between certainty and precision!

Interactive Challenge: Test Your Understanding

Scenario: You’re analyzing customer satisfaction scores (1-10 scale) for a new app. You survey 40 users and get a mean of 7.2 with a standard deviation of 1.8.

Questions:

  1. Can you apply the CLT here? (Check: n ≥ 30? ✓)
  2. What’s the 95% confidence interval for the true mean satisfaction?
  3. If you wanted a margin of error of only ±0.2, how many users would you need to survey?

Try it yourself, then check the solution below!

Click for Solution
import numpy as np
from scipy import stats

# Given data
n = 40
sample_mean = 7.2
sample_std = 1.8

# 95% confidence interval
margin_of_error = stats.t.ppf(0.975, n-1) * (sample_std / np.sqrt(n))
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

# For margin of error = 0.2
desired_margin = 0.2
z_score = 1.96  # for 95% confidence
required_n = ((z_score * sample_std) / desired_margin) ** 2

print(f"Required sample size for ±0.2 margin: {int(np.ceil(required_n))} users")

# Visualization
fig = go.Figure()

# Current confidence interval
fig.add_shape(
    type="rect",
    x0=ci_lower, x1=ci_upper, y0=0, y1=1,
    fillcolor="lightblue", opacity=0.5,
    line=dict(color="blue", width=2)
)

# Desired confidence interval
desired_ci_lower = sample_mean - desired_margin
desired_ci_upper = sample_mean + desired_margin
fig.add_shape(
    type="rect",
    x0=desired_ci_lower, x1=desired_ci_upper, y0=1.2, y1=2.2,
    fillcolor="lightgreen", opacity=0.5,
    line=dict(color="green", width=2)
)

fig.add_vline(x=sample_mean, line=dict(color="red", width=3))

fig.add_annotation(x=sample_mean, y=0.5, text=f"Current: n={n}", showarrow=False)
fig.add_annotation(x=sample_mean, y=1.7, text=f"Desired: n={int(np.ceil(required_n))}", showarrow=False)

fig.update_layout(
    title="Sample Size vs. Precision Trade-off",
    xaxis_title="Satisfaction Score",
    yaxis_title="",
    height=300
)
fig.show()

Answers:

  1. Yes! n = 40 > 30, so CLT applies
  2. 95% CI: [6.62, 7.78]
  3. You’d need about 312 users for that precision!

Key Lesson: Precision is expensive! Going from ±0.58 to ±0.2 requires almost 8x more data.

What CLT Doesn’t Do (Important Limitations)

While CLT is powerful, it’s not magic. Here’s what it can’t help with:

🚫 Biased Samples

Problem: If your sample isn’t representative, CLT won’t fix that. Example: Surveying only iPhone users about phone preferences won’t tell you about Android users! Solution: Focus on proper sampling methodology first.

🚫 Very Small Samples

Problem: CLT needs "sufficiently large" samples. Example: For very skewed data, n = 5 won’t cut it. Solution: Use bootstrap methods or exact distributions for small samples.

🚫 Dependent Data

Problem: CLT assumes independence. Example: Stock prices over time influence each other. Solution: Use time series analysis or account for correlation structure.

🚫 Infinite Variance

Problem: Some theoretical distributions have infinite variance. Example: Cauchy distribution (rare in practice). Solution: Use robust statistics or different theoretical frameworks.

Case Studies Across Industries

🏥 Medical Research: Drug Trial

Scenario: Testing a new blood pressure medication.

  • Population: All patients with hypertension
  • Sample: 200 patients in clinical trial
  • Measurement: Change in systolic blood pressure
  • CLT Application: Confidence interval for mean improvement
# Simulate drug trial data
np.random.seed(123)
bp_reduction = np.random.normal(12, 8, 200)  # Mean reduction: 12 mmHg

n = len(bp_reduction)
sample_mean = np.mean(bp_reduction)
sample_std = np.std(bp_reduction, ddof=1)

# 95% confidence interval
margin_of_error = stats.t.ppf(0.975, n-1) * (sample_std / np.sqrt(n))
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Drug Trial Results:")
print(f"Mean BP reduction: {sample_mean:.1f} mmHg")
print(f"95% CI: [{ci_lower:.1f}, {ci_upper:.1f}] mmHg")
print(f"Significant improvement: {'✅ YES' if ci_lower > 0 else '❌ NO'}")

🗳️ Political Polling: Election Prediction

Scenario: Predicting election results.

  • Population: All eligible voters
  • Sample: 1,000 survey respondents
  • Measurement: Proportion supporting candidate A
  • CLT Application: Confidence interval for vote share
# Simulate polling data
np.random.seed(456)
support_rate = 0.52  # True support rate: 52%
poll_responses = np.random.binomial(1, support_rate, 1000)

n = len(poll_responses)
sample_prop = np.mean(poll_responses)
sample_std = np.sqrt(sample_prop * (1 - sample_prop))  # Binomial std

# 95% confidence interval for proportion
margin_of_error = 1.96 * (sample_std / np.sqrt(n))
ci_lower = sample_prop - margin_of_error
ci_upper = sample_prop + margin_of_error

print(f"Polling Results:")
print(f"Support rate: {sample_prop:.1%}")
print(f"95% CI: [{ci_lower:.1%}, {ci_upper:.1%}]")
print(f"Margin of error: ±{margin_of_error:.1%}")

🌐 A/B Testing: Website Optimization

Scenario: Testing two website designs.

  • Population: All website visitors
  • Sample: 5,000 visitors per variant
  • Measurement: Conversion rate
  • CLT Application: Compare confidence intervals
# Simulate A/B test data
np.random.seed(789)
conversion_a = np.random.binomial(1, 0.08, 5000)  # Control: 8%
conversion_b = np.random.binomial(1, 0.095, 5000)  # Variant: 9.5%

def analyze_conversion(data, name):
    n = len(data)
    rate = np.mean(data)
    std = np.sqrt(rate * (1 - rate))
    margin = 1.96 * (std / np.sqrt(n))

    print(f"{name}:")
    print(f"  Conversion rate: {rate:.2%}")
    print(f"  95% CI: [{rate-margin:.2%}, {rate+margin:.2%}]")
    return rate, margin

rate_a, margin_a = analyze_conversion(conversion_a, "Control (A)")
rate_b, margin_b = analyze_conversion(conversion_b, "Variant (B)")

# Test for significant difference
diff = rate_b - rate_a
diff_std = np.sqrt(margin_a**2 + margin_b**2)
significant = abs(diff) > 1.96 * diff_std

print(f"\nDifference: {diff:.2%}")
print(f"Statistically significant: {'✅ YES' if significant else '❌ NO'}")

Advanced Topics: When CLT Gets Interesting

Sample Size Calculation

Question: How many samples do you need for a given precision?

Formula:

Where:

  • = critical value (1.96 for 95% confidence)
  • = population standard deviation (estimated)
  • = desired margin of error
def calculate_sample_size(confidence_level, margin_of_error, std_dev):
    """Calculate required sample size for given precision."""
    alpha = 1 - confidence_level
    z_critical = stats.norm.ppf(1 - alpha/2)

    n = (z_critical * std_dev / margin_of_error) ** 2
    return int(np.ceil(n))

# Example: Customer satisfaction survey
required_n = calculate_sample_size(
    confidence_level=0.95,
    margin_of_error=0.2,
    std_dev=1.8
)

print(f"Required sample size: {required_n}")

Bootstrap vs. CLT

Bootstrap: Computer-intensive alternative to CLT that works with smaller samples.

def bootstrap_ci(data, n_bootstrap=10000, confidence_level=0.95):
    """Calculate confidence interval using bootstrap."""
    bootstrap_means = []
    n = len(data)

    for _ in range(n_bootstrap):
        bootstrap_sample = np.random.choice(data, size=n, replace=True)
        bootstrap_means.append(np.mean(bootstrap_sample))

    alpha = 1 - confidence_level
    lower_percentile = (alpha/2) * 100
    upper_percentile = (1 - alpha/2) * 100

    ci_lower = np.percentile(bootstrap_means, lower_percentile)
    ci_upper = np.percentile(bootstrap_means, upper_percentile)

    return ci_lower, ci_upper

# Compare CLT vs Bootstrap
small_sample = np.random.exponential(2, 15)  # Small, skewed sample

# CLT approach
clt_mean = np.mean(small_sample)
clt_std = np.std(small_sample, ddof=1)
clt_margin = stats.t.ppf(0.975, len(small_sample)-1) * (clt_std / np.sqrt(len(small_sample)))
clt_ci = (clt_mean - clt_margin, clt_mean + clt_margin)

# Bootstrap approach
bootstrap_ci_result = bootstrap_ci(small_sample)

print(f"CLT CI: [{clt_ci[0]:.2f}, {clt_ci[1]:.2f}]")
print(f"Bootstrap CI: [{bootstrap_ci_result[0]:.2f}, {bootstrap_ci_result[1]:.2f}]")

Practice Problems with Solutions

Problem 1: Coffee Shop Revenue

Question: A coffee shop’s daily revenue has a mean of $1,200 and standard deviation of $300. If you calculate the average revenue over 25 days, what’s the probability this average exceeds $1,300?

Solution
# Given information
mu = 1200  # Population mean
sigma = 300  # Population std
n = 25  # Sample size
target = 1300  # Target value

# Sampling distribution parameters
sampling_mean = mu
sampling_std = sigma / np.sqrt(n)  # Standard error

# Calculate probability
z_score = (target - sampling_mean) / sampling_std
prob = 1 - stats.norm.cdf(z_score)

print(f"Sampling distribution: N({sampling_mean}, {sampling_std:.1f})")
print(f"Z-score: {z_score:.2f}")
print(f"P(sample mean > $1300) = {prob:.4f} or {prob:.2%}")

Answer: About 4.78% chance

Problem 2: Manufacturing Quality

Question: Light bulbs have lifespans with mean 1000 hours and standard deviation 200 hours. In a sample of 64 bulbs, what’s the probability the sample mean is between 950 and 1050 hours?

Solution
# Given information
mu = 1000
sigma = 200
n = 64
lower_bound = 950
upper_bound = 1050

# Sampling distribution
sampling_std = sigma / np.sqrt(n)

# Calculate z-scores
z_lower = (lower_bound - mu) / sampling_std
z_upper = (upper_bound - mu) / sampling_std

# Calculate probability
prob = stats.norm.cdf(z_upper) - stats.norm.cdf(z_lower)

print(f"Sampling distribution: N({mu}, {sampling_std:.1f})")
print(f"Z-scores: {z_lower:.2f} to {z_upper:.2f}")
print(f"P(950 < sample mean < 1050) = {prob:.4f} or {prob:.2%}")

Answer: About 95.45% chance

Key Takeaways for Practitioners

🎯 CLT is your foundation: Most statistical inference relies on it

📊 Confidence intervals > point estimates: Always quantify uncertainty

🔍 Sample size matters: But there are diminishing returns

⚠️ Check your assumptions: Independence, sufficient sample size, representative sampling

🛠️ Multiple tools available: CLT, bootstrap, exact methods - choose appropriately

💡 Context is king: Statistical significance ≠ practical significance

What’s Next?

Now that you’ve mastered CLT applications, you’re ready to explore:

  • Hypothesis Testing: Using CLT to test specific claims about populations
  • Regression Analysis: How CLT underlies the assumptions in linear models
  • ANOVA: Comparing multiple groups using CLT principles
  • Bayesian Statistics: A different approach to uncertainty quantification

Further Reading

  • Advanced: "Mathematical Statistics with Applications" by Wackerly, Mendenhall, and Scheaffer
  • Practical: "Practical Statistics for Data Scientists" by Bruce & Bruce
  • Online: Duke’s Inferential Statistics course on Coursera
  • Interactive: Try the examples in this article with your own data!

Remember: The Central Limit Theorem isn’t just theory - it’s the practical foundation for making confident decisions with data in the real world! 🚀

> comment on twitter
>