Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©
Linear Regression Models Simple Linear Regression
Simple linear regression is the study of the linear relationship between two random variables X and Y. We call X the independent or explanatory variable and we call Y the dependent, the predicted or the forecast variable. We assume that we can represent represent the relationship between population values of X and Y using the equation of the linear model Y i = β 0 + β 1 X i + ε i . We call β 0 the Y-intercept and it represents the expected value of Y that is independent of X or or the expected value of Y when X equals zero, if appropriate. We call β 1 the slope and it represents the expected change in Y per unit change in X, i.e., the expected marginal change in Y with respect to X. Finally, we call ε i the random error in Y for each observation i . Given a sample of X and Y values, we use the method of least squares to estimate sample values for β 0 and β 1 which we call b0 and b1 , respectively. We represent the predicted value of Y using the prediction line equation or the simple linear regression equation ˆ Y
=
b0
+
b1 X .
We call b0 the sample Y-intercept and it represents the expected value of Y that is independent of X (or the expected value of Y when X=0, if appropriate), and we call b1 the sample slope and it represents the expected change in Y per unit change in X, i.e., the expected marginal change in Y with respect to X. Finally, we call Y ˆ the predicted value of Y. The coefficient of Determination and the Correlation coefficient
The coefficient of determination is the statistic r 2 and it measures the proportion of the linear variation in Y that is explained by X using the regression model. The correlation coefficient is the statistic association between X and Y.
r and
it measures the strength of the linear
t Test for the Slope β 1
H0: β 1 = 0 (there is no linear relationship) H1: β 1 ≠ 0 (there is a linear relationship) relationship) For α = .05 the t he p-value for the slope (the coefficient of the explanatory variable) should be less than .05 for th e sample slope slop e b1 to be statistically significantly different from zero, indicating the presence of a statistically significant linear relationship between X and Y.
Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©
Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©
Important Notice: The p-value for the slope and the Significance F are the same in Simple Linear Regression, leading to the same conclusion. Acceptable values of α are less than 0.10 with preferred values being less than or equal to 0.05. Multiple-Linear Regression
Multiple-linear regression or multiple regression is the study of the linear relationship between more than two random variables X1, X2, …, Xk , and Y. We call the Xi, i=1,2,…,k, the independent or explanatory variables and we call Y the dependent, the predicted or the forecast variable. We assume that we can represent the relationship between population values of Xi and Y using the equation of the linear model Y i = β 0 + β 1 X i + β 2 X 2 + ... + β k X k + ε i . We call β 0 the Y-intercept and it represents the expected value of Y that is independent of X or the expected value of Y when each Xi equals zero, if appropriate. We call β i the slope of Y with variable Xi, holding each X j, j≠i, constant; and it represents the expected change in Y per unit change in X i, i.e., the expected marginal change in Y with respect to Xi, holding each X j, j≠i, constant. Finally, we call ε i the random error in Y for each observation i . Given a sample of X and Y values, we use the method of least squares to estimate sample values for β 0 and β i which we call b0 and bi for all i = 1,…,k, respectively. We represent the predicted value of Y using the prediction line equation or the simple linear regression equation ˆ Y
=
b0
+
b1 X 1
+ ... +
bk X k .
We call b0 the sample Y-intercept and it represents the expected value of Y that is independent of all Xi (i=1, 2,…, n) in the model or the expected value of Y when each X i equals zero, if appropriate. We call bi the sample slope of Y with variable Xi , holding each X j, j≠i, constant; and it represents the expected change in Y per unit change in X i, i.e., the expected marginal change in Y with respect to Xi, holding each X j, j≠i, constant. Finally, we call Y ˆ the predicted value of Y. The Adjusted
r
2
The adjusted r 2 measures the proportion of the linear variation in Y that is explained by all Xi (i=1, 2,…, n) in the multiple-regression model—adjusted for the number of independent variables (Xi) and sample size. The coefficient of Partial Determination
The coefficient of partial determination measures the proportion of the linear variation in Y that is explained by a particular Xi , holding each X j, j≠i, constant.
Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©
Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©
Significance Testing
In a multiple-regression model with two or more independent variables we recommend that those independent variables that fail the t-test of significance for their slope coefficients, be removed from the model, one independent variable at a time, and the model be run again without them. We also recommend that the independent variable that has the highest insignificant p-value be removed first and the model be run again without this variable. This process should be repeated until we are left with only those independent variables which pass the significance test. Recall that this t-test of significance is the same as the slope test in a simple linear regression model (outlined above). Caution: Removing all independent variables that fail the t-test after the first run is not advisable because two or more of these variables might be collinear or highly correlated, i.e. they explain the same variability in the dependent variable Y. Interactions
Interaction terms, or cross-product terms, are introduced into a multiple-regression model when the effect of an independent variable, Xi on the dependent variable Y changes according to the values of other independent variables, X j , j≠i. In such cases we recommend running the model with all the relevant interaction terms, and removing those interaction terms that are not statistically significant, one term at a time starting with the one that has the highest insignificant p-value, while keeping those that are statistically significant, then running the model again followed by significance testing to confirm the effects of all remaining interaction terms.
Ali Jenzarli, Ph.D.
All Rights Reserved
Copyright ©