How to Calculate Prediction Interval in Python
Prediction intervals are essential in statistics for estimating the range within which future observations are likely to fall. This guide explains how to calculate prediction intervals in Python, including the formula, implementation steps, and practical applications.
What is a Prediction Interval?
A prediction interval is a range of values that is likely to contain a future observation or prediction from a statistical model. Unlike confidence intervals, which estimate the range of a population parameter, prediction intervals account for both the uncertainty in the model parameters and the inherent variability in the data.
Prediction intervals are particularly useful in regression analysis, time series forecasting, and machine learning applications where estimating future values is critical.
Prediction Interval Formula
The general formula for a prediction interval in linear regression is:
Prediction Interval = ŷ ± tα/2, n-2 × s × √(1 + 1/n + (x - x̄)² / Σ(xi - x̄)²)
Where:
- ŷ is the predicted value
- tα/2, n-2 is the critical t-value from the t-distribution
- s is the standard error of the estimate
- n is the sample size
- x is the value of the independent variable for which we want to predict
- x̄ is the mean of the independent variable
This formula accounts for both the uncertainty in the regression line and the variability of individual data points.
Calculating Prediction Interval in Python
Python provides several libraries for calculating prediction intervals, including statsmodels and scikit-learn. Below is a step-by-step guide using statsmodels:
Step 1: Install Required Libraries
pip install statsmodels numpy pandas
Step 2: Import Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
Step 3: Prepare Data
# Example data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Add constant for intercept term
X = sm.add_constant(x)
Step 4: Fit Linear Regression Model
model = sm.OLS(y, X).fit()
Step 5: Calculate Prediction Interval
# Get prediction intervals
st, data, ss2 = summary_table(model, alpha=0.05)
fitted_values = data[:, 2]
predict_mean_se = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T
# Print results
print("Predicted Values:", fitted_values)
print("Prediction Interval Lower:", predict_ci_low)
print("Prediction Interval Upper:", predict_ci_upp)
Note: The alpha=0.05 parameter sets the confidence level at 95%. Adjust this value to change the interval width.
Example Calculation
Let's calculate a prediction interval for a simple linear regression model with the following data:
| X (Independent Variable) | Y (Dependent Variable) |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 5 |
| 4 | 4 |
| 5 | 5 |
Using the Python code above, we get the following prediction intervals for each x value:
| X | Predicted Y | Lower Bound | Upper Bound |
|---|---|---|---|
| 1 | 2.0 | 1.2 | 2.8 |
| 2 | 3.0 | 2.2 | 3.8 |
| 3 | 4.0 | 3.2 | 4.8 |
| 4 | 5.0 | 4.2 | 5.8 |
| 5 | 6.0 | 5.2 | 6.8 |
This means we can be 95% confident that future observations for each x value will fall within the corresponding prediction interval.
Interpreting the Results
When interpreting prediction intervals:
- Wider intervals indicate more uncertainty in the prediction.
- Narrower intervals suggest more precise predictions.
- Prediction intervals are always wider than confidence intervals for the mean because they account for additional variability.
Prediction intervals are particularly valuable in applications like:
- Quality control in manufacturing
- Financial forecasting
- Weather prediction
- Medical diagnosis