How to Calculate Confidence Interval Linear Regression in R
Linear regression is a fundamental statistical technique used to model the relationship between a dependent variable and one or more independent variables. When performing linear regression, it's essential to understand the confidence intervals associated with the estimated coefficients. This guide will walk you through calculating confidence intervals for linear regression in R, including step-by-step instructions, formulas, and practical examples.
What is Linear Regression?
Linear regression is a statistical method that models the relationship between a dependent variable (response) and one or more independent variables (predictors) by fitting a linear equation to observed data. The simplest form of linear regression is the simple linear regression model, which involves one independent variable:
Simple Linear Regression Model:
Y = β₀ + β₁X + ε
Where:
- Y = dependent variable
- β₀ = intercept (value of Y when X=0)
- β₁ = slope (change in Y for a one-unit change in X)
- X = independent variable
- ε = random error term
The goal of linear regression is to estimate the coefficients β₀ and β₁ that minimize the sum of squared residuals between the observed values and the values predicted by the linear model.
Understanding Confidence Intervals
A confidence interval is a range of values that is likely to contain the true population parameter with a certain level of confidence. In the context of linear regression, confidence intervals are typically calculated for the regression coefficients (β₀ and β₁) to assess their precision and reliability.
The general formula for the confidence interval of a regression coefficient is:
Confidence Interval for Regression Coefficients:
β̂ ± t*(α/2, n-p-1) * SE(β̂)
Where:
- β̂ = estimated coefficient
- t*(α/2, n-p-1) = critical t-value from t-distribution
- SE(β̂) = standard error of the coefficient
- α = significance level (e.g., 0.05 for 95% confidence)
- n = sample size
- p = number of predictors
The standard error of the coefficient can be calculated as:
Standard Error of Coefficient:
SE(β̂) = √(σ² * (X'X)⁻¹)
Where:
- σ² = variance of the error term
- X'X = cross-product matrix of the independent variables
In practice, R calculates these values automatically when you fit a linear regression model using the lm() function.
Calculating Confidence Intervals in R
R provides several functions to calculate confidence intervals for linear regression coefficients. The most common approach is to use the confint() function on a fitted linear model object.
Step-by-Step Instructions
- Load your data into R.
- Fit a linear regression model using the
lm()function. - Use the
confint()function to calculate confidence intervals. - Interpret the results.
Example Code
Here's a complete example of how to calculate confidence intervals for a linear regression model in R:
# Load required data
data <- data.frame(
y = c(5, 7, 8, 9, 10, 12, 13, 15, 16, 18),
x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
# Fit linear regression model
model <- lm(y ~ x, data = data)
# Calculate 95% confidence intervals
confidence_intervals <- confint(model)
# Print results
print(confidence_intervals)
The output will show the estimated coefficients along with their 95% confidence intervals. The first column shows the lower bound, and the second column shows the upper bound of the confidence interval.
Worked Example
Let's consider a simple example where we want to model the relationship between hours studied (x) and exam scores (y). We'll use the following data:
| Hours Studied (x) | Exam Score (y) |
|---|---|
| 1 | 5 |
| 2 | 7 |
| 3 | 8 |
| 4 | 9 |
| 5 | 10 |
| 6 | 12 |
| 7 | 13 |
| 8 | 15 |
| 9 | 16 |
| 10 | 18 |
Using the R code provided earlier, we fit a linear regression model and calculate the 95% confidence intervals for the coefficients. The results might look like this:
| Coefficient | Estimate | Lower CI | Upper CI |
|---|---|---|---|
| (Intercept) | 3.2 | 1.8 | 4.6 |
| x | 1.2 | 0.9 | 1.5 |
This means we can be 95% confident that the true intercept value lies between 1.8 and 4.6, and the true slope value lies between 0.9 and 1.5.
Interpreting Results
When interpreting confidence intervals for linear regression coefficients, consider the following:
- If the confidence interval for a coefficient includes zero, it suggests that the coefficient is not statistically significant at the chosen confidence level.
- A narrower confidence interval indicates greater precision in the estimate of the coefficient.
- Confidence intervals provide a range of plausible values for the true population parameter, accounting for sampling variability.
In our example, both the intercept and slope coefficients have confidence intervals that do not include zero, indicating that both are statistically significant at the 95% confidence level.
FAQ
What is the difference between a confidence interval and a prediction interval in linear regression?
A confidence interval estimates the range of plausible values for the true population parameter (e.g., regression coefficients), while a prediction interval estimates the range of plausible values for a new observation. Prediction intervals are typically wider than confidence intervals because they account for additional uncertainty in predicting future values.
How do I change the confidence level in R?
By default, the confint() function in R calculates 95% confidence intervals. To change the confidence level, you can use the level argument. For example, to calculate 90% confidence intervals, you would use confint(model, level = 0.9).
What assumptions are required for linear regression confidence intervals to be valid?
Linear regression confidence intervals rely on several key assumptions, including linearity, independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violations of these assumptions can affect the validity of the confidence intervals.