Simple Linear Regression

Quantitative Methods

Estimation of the Simple Linear Regression Model

Learning Outcome Statement:

describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of these coefficients

Summary:

This LOS covers the estimation of simple linear regression models, focusing on the relationship between dependent and independent variables, the estimation of regression coefficients using least squares, and the interpretation of these coefficients. It also discusses the assumptions necessary for valid regression analysis and introduces methods for transforming non-linear data to fit linear models.

Key Concepts:

Standard Error of the Estimate

A measure of the distance between observed values of the dependent variable and those predicted by the regression model. A smaller value indicates a better fit.

Standard Error of the Forecast

Used to provide an interval estimate around the regression line, acknowledging that the line does not perfectly describe the relationship between variables.

Functional Forms for Non-linear Data

Includes transformations like log-lin, lin-log, and log-log models to adjust simple linear regression models to fit non-linear data effectively.

Goodness-of-Fit Measures

Includes the coefficient of determination (R2), the F-statistic, and the standard error of the estimate, used to evaluate the fit of the regression model.

Least Squares Criterion

A method used to estimate the regression coefficients by minimizing the sum of the squared vertical distances (residuals) between observed and predicted values.

Interpretation of Regression Coefficients

The intercept represents the expected value of the dependent variable when the independent variable is zero. The slope indicates the change in the dependent variable for a one-unit change in the independent variable.

Formulas:

Sum of Squares Total (SST)

SST=i=1n(YiYˉ)2\text{SST} = \sum_{i=1}^{n} (Y_i - \bar{Y})^2

Total variability in the dependent variable around its mean.

Variables:
YiY_i:
Observation of the dependent variable
Yˉ\bar{Y}:
Mean of the dependent variable
nn:
Number of observations
Units: Units of Y

Slope Coefficient

b^1=i=1n(YiYˉ)(XiXˉ)i=1n(XiXˉ)2\hat{b}_1 = \frac{\sum_{i=1}^{n} (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}

Estimates the change in the dependent variable per unit change in the independent variable.

Variables:
YiY_i:
Observation of the dependent variable
XiX_i:
Observation of the independent variable
Yˉ\bar{Y}:
Mean of the dependent variable
Xˉ\bar{X}:
Mean of the independent variable
nn:
Number of observations
Units: Units of Y per units of X

Intercept Coefficient

b^0=Yˉb^1Xˉ\hat{b}_0 = \bar{Y} - \hat{b}_1 \bar{X}

Predicted value of the dependent variable when the independent variable is zero.

Variables:
Yˉ\bar{Y}:
Mean of the dependent variable
b^1\hat{b}_1:
Estimated slope coefficient
Xˉ\bar{X}:
Mean of the independent variable
Units: Units of Y

Hypothesis Tests in the Simple Linear Regression Model

Learning Outcome Statement:

calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression

Summary:

This LOS covers the evaluation of the simple linear regression model through hypothesis testing, analysis of variance (ANOVA), and measures of goodness of fit. It includes breaking down the total variation into explained and unexplained parts, calculating the coefficient of determination (R-squared), and testing hypotheses about regression coefficients using t-tests and F-tests.

Key Concepts:

Sum of Squares

Total variation in the dependent variable (SST) is decomposed into the sum of squares due to regression (SSR) and the sum of squares due to error (SSE), with SST = SSR + SSE.

Coefficient of Determination (R-squared)

This is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is calculated as SSR/SST.

F-test

Used to determine the overall fit of the regression model. It compares the variance explained by the model to the variance unexplained, using the ratio MSR/MSE.

t-test for Regression Coefficients

Used to test the significance of individual regression coefficients in the model. The null hypothesis typically states that the coefficient is equal to zero (no effect).

Formulas:

Coefficient of Determination

R2=SSRSSTR^2 = \frac{SSR}{SST}

Measures the proportion of variability in the dependent variable that is explained by the regression model.

Variables:
R2R^2:
Coefficient of determination
SSRSSR:
Sum of squares due to regression
SSTSST:
Total sum of squares
Units: dimensionless

F-statistic

F=MSRMSEF = \frac{MSR}{MSE}

Used to test if the regression model provides a better fit to the data than a model with no independent variables.

Variables:
FF:
F-statistic for testing the overall regression equation
MSRMSR:
Mean square due to regression
MSEMSE:
Mean square error
Units: dimensionless

t-statistic for Regression Coefficients

t=b^1B1sb^1t = \frac{\hat{b}_1 - B_1}{s_{\hat{b}_1}}

Used to determine if individual regression coefficients are significantly different from zero or another value.

Variables:
tt:
t-statistic for testing individual regression coefficients
b1b_1:
Estimated regression coefficient
B1B_1:
Hypothesized value of the regression coefficient
sb^1s_{\hat{b}_1}:
Standard error of the estimated regression coefficient
Units: dimensionless

Assumptions of the Simple Linear Regression Model

Learning Outcome Statement:

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these assumptions may have been violated.

Summary:

The simple linear regression model relies on four key assumptions: linearity, homoskedasticity, independence, and normality of residuals. Violations of these assumptions can be detected through analysis of residuals and residual plots, which can show patterns indicating issues such as non-linearity, heteroskedasticity, autocorrelation, or non-normal distribution of residuals.

Key Concepts:

Linearity

The relationship between the dependent variable Y and the independent variable X must be linear. Non-linear relationships, when modeled as linear, can lead to biased estimates.

Homoskedasticity

The variance of the residuals should be constant across all observations. If the variance changes (heteroskedasticity), it can lead to inefficiencies in the estimation.

Independence

Observations must be independent of each other, meaning there should be no autocorrelation among residuals. Autocorrelation can lead to underestimation of the standard errors and misleading statistical tests.

Normality

Residuals should be normally distributed. This assumption is crucial for small sample sizes, as it underpins the validity of hypothesis tests concerning the regression coefficients.

Formulas:

Homoskedasticity Assumption

E(ϵi2)=σϵ2,i=1,,nE(\epsilon_i^2) = \sigma_{\epsilon}^2, \quad i = 1, \ldots, n

This formula states that the expected value of the squared residuals is constant across all observations.

Variables:
ϵi\epsilon_i:
residuals
σϵ2\sigma_{\epsilon}^2:
constant variance of residuals
Units: variance units

Prediction in the Simple Linear Regression Model

Learning Outcome Statement:

describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error of estimate in a simple linear regression; calculate and interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model and a value for the independent variable

Summary:

This LOS covers the use of ANOVA and standard error of estimate in simple linear regression for assessing model fit, and the calculation and interpretation of predicted values and prediction intervals for the dependent variable based on a given linear regression model and an independent variable value.

Key Concepts:

ANOVA in Simple Linear Regression

ANOVA (Analysis of Variance) is used in regression analysis to decompose the variability in the dependent variable into variability explained by the model (regression) and unexplained variability (error). It provides a framework to test the significance of the model.

Standard Error of Estimate

The standard error of the estimate quantifies the average distance that the observed values fall from the regression line. A smaller standard error indicates a better fit of the model to the data.

Prediction Using Simple Linear Regression

Predictions in regression are made using the estimated regression coefficients. The predicted value of the dependent variable is computed for given values of the independent variable.

Prediction Intervals

Prediction intervals provide a range within which we expect the actual value of the dependent variable to fall for a given value of the independent variable, with a certain level of confidence. It accounts for the uncertainty in the prediction.

Formulas:

Standard Error of the Estimate (se)

se=MSE=i=1n(YiY^i)2n2se = \sqrt{MSE} = \sqrt{\frac{\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2}{n-2}}

This formula calculates the standard error of the estimate, which measures the accuracy of predictions made by the regression model.

Variables:
sese:
Standard error of the estimate
MSEMSE:
Mean square error
YiY_i:
Observed values
Y^i\hat{Y}_i:
Predicted values
nn:
Number of observations
Units: units of the dependent variable

Predicted Value of the Dependent Variable

Y^f=b^0+b^1Xf\hat{Y}_f = \hat{b}_0 + \hat{b}_1 X_f

This formula is used to predict the value of the dependent variable based on the estimated regression coefficients and a specific value of the independent variable.

Variables:
Y^f\hat{Y}_f:
Predicted value of the dependent variable
b^0\hat{b}_0:
Estimated intercept
b^1\hat{b}_1:
Estimated slope
XfX_f:
Value of the independent variable for prediction
Units: units of the dependent variable

Standard Error of the Forecast (sf)

sf=se1+1n+(XfXˉ)2(n1)sX2s_f = se \sqrt{1 + \frac{1}{n} + \frac{(X_f - \bar{X})^2}{(n-1)s_X^2}}

This formula calculates the standard error of the forecast, which measures the uncertainty in the predicted value of the dependent variable.

Variables:
sfs_f:
Standard error of the forecast
sese:
Standard error of the estimate
nn:
Number of observations
XfX_f:
Forecasted value of the independent variable
Xˉ\bar{X}:
Mean of the independent variable
sX2s_X^2:
Variance of the independent variable
Units: units of the dependent variable

Functional Forms for Simple Linear Regression

Learning Outcome Statement:

describe different functional forms of simple linear regressions

Summary:

This LOS explores various functional forms for simple linear regression beyond the basic linear model to better fit non-linear relationships in economic and financial data. It covers the log-lin, lin-log, and log-log models, each involving logarithmic transformations of the dependent and/or independent variables. The choice of the correct functional form is crucial for improving the model's fit, as evidenced by goodness-of-fit measures and residual analysis.

Key Concepts:

Log-Lin Model

In the log-lin model, the dependent variable is transformed using the natural logarithm, while the independent variable remains linear. This model is useful when changes in the independent variable have multiplicative effects on the dependent variable.

Lin-Log Model

The lin-log model involves transforming the independent variable using the natural logarithm while keeping the dependent variable linear. It is suitable when relative changes in the independent variable lead to absolute changes in the dependent variable.

Log-Log Model

Both the dependent and independent variables are transformed using logarithms in the log-log model. This model is particularly useful for estimating elasticities, as the coefficients represent the elasticity of the dependent variable with respect to the independent variable.

Selecting the Correct Functional Form

The correct functional form is determined by examining goodness-of-fit measures such as R-squared, F-statistic, and the standard error of the estimate. Residual plots are also used to check for patterns that might suggest a poor fit.

Formulas:

Log-Lin Model Equation

lnYi=b0+b1Xiln Y_i = b_0 + b_1 X_i

This equation represents a regression model where the dependent variable is logarithmically transformed.

Variables:
YiY_i:
dependent variable
XiX_i:
independent variable
b0b_0:
intercept
b1b_1:
slope coefficient
Units: none

Lin-Log Model Equation

Yi=b0+b1lnXiY_i = b_0 + b_1 ln X_i

This equation represents a regression model where the independent variable is logarithmically transformed.

Variables:
YiY_i:
dependent variable
XiX_i:
independent variable
b0b_0:
intercept
b1b_1:
slope coefficient
Units: none

Log-Log Model Equation

lnYi=b0+b1lnXiln Y_i = b_0 + b_1 ln X_i

This equation represents a regression model where both the dependent and independent variables are logarithmically transformed.

Variables:
YiY_i:
dependent variable
XiX_i:
independent variable
b0b_0:
intercept
b1b_1:
slope coefficient
Units: none