MULTIPLE REGRESSION – PART PART 1 Topics Outline Multiple Regression Model Inferences about Regression Coefficients Test for the Overall Fit F Test Residual Analysis Collinearity Multiple Regression Model
Multiple regression models use two or more explanatory (independent) va riables to predict the value of a response (dependent) variable. With k explanatory explanatory variables, the multiple regression model is expressed as follows: y 1 x1 2 x 2 k xk
Here , 1 , 2 , , k are the parameters and the error term is a random variable which accounts for the variability in y that cannot be explained by the linear effect of the k explanatory explanatory variables. The assumptions assumptions about the error term in the multiple multiple regression model parallel those for the simple regression model.
Regression Assumptions
1. Linearity The error term is a random variable with a mean 0. Implication: For given values of x1 , x2 , , xk , the expected, or average, value of y is given by
E ( y )
x
y
1 1
x 2
2
k
x k
(The relationship is “linear ”, ”, because each term on the right-hand side of the equation is additive, and the regression parameters do not enter the equation equ ation in a nonlinear manner, such as i2 xi . The graph of the relationship is no longer a line, however, because there are more than two variables involved.) 2. Independence The values of are statistically independent. Implication: The value of y for a particular set of values for the explanatory variables is not related to the value of y for any other set of values. 3. Normality The error term is a normally normally distributed random variable (with mean 0 and standard standard deviation o f x1 , x2 , , xk , Implication: Because , 1 , 2 , , k are constants for the given values of the response variable y is also a normally distributed random variable (with mean y ). 1 x1 2 x 2 k x k and standard deviation
4. Equal spread The standard deviation
of
is the same for all values of the explanatory variables x1 , x2 ,
Implication: The standard deviation of y about the regression line equals all values of x1 , x2 , , xk .
-1-
, xk .
and is the same for
).
Assumption 1 implies that the true population surface ( “plane”, “line”) is E ( y )
x
y
x2
1 1
2
k
x k
Sometimes we refer to this surface as the surface (plane, line) of means . In the simple linear regression, the slope represents the change in the mean of y per unit change in x and does not take into account any other variables. In the multiple regression model, the slope 1 represents the change in the mean of y per unit change in x1 , taking into account the effect of x2 , x 3 ,..., xk . The estimation process for multiple regression is shown in Figure 1. As in the case of simple linear regression, you use a simple random sample and the least squares method – that is, n
minimizing the sum of squared residuals: min
y i
yi ˆ
2
– to compute sample regression
i 1
coefficients a, b1 , , bk as estimates of the population parameters , 1 , , k . (In multiple regression, the presentation of the formulas for the regression coefficients involves the use of matrix algebra and is beyond the scope of this course.)
Multiple Regression Model 1 x1 2 x2 k xk
y
Sample Data
( – st.dev. of
y
)
True Population Surface 1 x1 2 x 2 k x k
x1
x2
. . .
. . .
x k
y
. . .
. . .
Regression Parameters , 1 , 2 , , k ,
The values of a, b1 , b2 , , bk , s provide the estimates of , 1 , 2 , , k ,
Compute the sample statistics a, b1 , b2 , , bk , s and the estimated regression equation y a b1 x1 b2 x2 bk xk
ˆ
Figure 1 The estimation process for multiple regression
-2-
The sample statistics a, b1 ,
, bk provide the following estimated multiple regression equation
y ˆ
a
b1 x1
b2 x2
bk xk
where a is again the y-intercept, and b1 through bk are the “slopes”. This is the equation of the fitted surface also known as the least squares surface (plane, line) . Graphically, you are no longer fitting a line to a set of points. If there are exactly two explanatory variables, you are fitting a plane to the data in three-dimensional space. There is one dimension for the response variable and one for each of the two explanatory variables.
If there are more than two explanatory variables, then you can only imagine the regression surface; drawing in four or more dimensions is impossible.
Interpretation of Regression Coefficients
The intercept a is the predicted value of y when all of the x’ s equal zero. (Of course, this makes sense only if it is practical for all of the x’ s to equal zero, which is seldom the case.) Each slope coefficient is the predicted change in y per unit change in a particular x, holding constant the effect of the other x variables. For example, b1 is the predicted change in y when x1 increases by one unit and the other x’ s in the equation, x2 through x k , remain constant.
-3-
Example 1 OmniFoods
OmniFoods is a large food products company. The company is planning a nationwide introduction of OmniPower, a new high-energy bar. Originally marketed to runners, mountain climbers, and other athletes, high-energy bars are now popular with the general public. OmniFoods is anxious to capture a share of this thriving market. The business objective facing the marketing manager at OmniFoods is to develop a model to predict monthly sales volume per store of OmniPower bars and to determine what variables influence sales. Two explanatory variables are considered here: x1 – the price of an OmniPower bar, measured in cents and x2 – the monthly budget for in-store promotional expenditures, measured in dollars.
In-store promotional expenditures typically include signs and displays, in-store coupons, and free samples. The response variable y is the number of OmniPower bars sold in a month. Data are collected (and stored in OmniPower.xlsx ) from a sample of 34 stores in a supermarket chain selected for a test-market study of OmniPower.
1
Number of Bars 4141
Price (cents) 59
Promotion ($) 200
2
3842
59
200
33
3354
99
600
34
2927
99
600
Store
Here is the regression output. Regression Statistics
Multiple R R Square Adjusted R Square Standard Error
0.8705 0.7577 0.7421 638.0653
Observations ANOVA
34 df
SS
MS
Regression Residual
2 31
39472730.77 12620946.67
Total
33
52093677.44
Coefficients
Intercept Price Promotion
Standard Error
F
Significance F
19736365.39 407127.31
48.48
0.0000
t Stat
P-value
Lower 95%
Upper 95%
5837.5208 -53.2173
628.1502 6.8522
9.2932 -7.7664
0.0000 0.0000
4556.3999 -67.1925
7118.6416 -39.2421
3.6131
0.6852
5.2728
0.0000
2.2155
5.0106
-4-
The computed values of the regression coefficients are a = 5,837.5208, b1 = – 53.2173, b2 = 3.6131. Therefore, the multiple regression equation (representing the fitted regression plane) is y ˆ
5837.5208 53.2173 x1 3.6131x2 or
Predicted Bars =5,837.5208 – 53.2173 Price + 3.6131 Promotion
Interpretation of intercept The sample y-intercept (a = 5,837.5208 6,000) estimates the number of OmniPower bars sold in a month if the price is $0.00 and the total amount spent on promotional expenditures is also $0.00. Because these values of price and promotion are outside the range of price and promotion used in the test-market study, and because they make no sense in the context of the problem, the value of a has little or no practical interpretation. Interpretation of slope coefficients The slope of price with OmniPower sales (b1 = – 53.2173) indicates that, for a given amount of monthly promotional expenditures, the predicted sales of OmniPower are estimated to decrease by 53.2173 53 bars per month for each 1-cent increase in the price. The slope of monthly promotional expenditures with OmniPower sales (b2 = 3.6131) indicates that, for a given price, the estimated sales of OmniPower are predicted to increase by 3.6131 4 bars for each additional $1 spent on promotions. These estimates allow you to better understand the likely effect that price and promotion decisions will have in the marketplace. For example, a 10-cent decrease in price is predicted to increase sales by 532.173 532 bars, with a fixed amount of monthly promotional expenditures. A $100 increase in promotional expenditures is predicted to increase sales by 361.31 361 bars, for a given price.
Predicting the Response Variable
What are the predicted sales for a store charging 79 cents per bar during a month in which promotional expenditures are $400? Using the multiple regression equation with x1 = 79 and x2 = 400, y = 5837.5208 – 53.2173(79) + 3.6131(400) = 3,078.57 ˆ
Thus, stores charging 79 cents per bar and spending $400 in promotional expenditures will sell 3,078.57 3,079 OmniPower bars per month.
-5-
2
Interpretation of s e , r , and r
The interpretation of these quantities is almost exactly the same as in simple regression. The standard error of estimate se is essentially the standard deviation of residuals, but it is now given by the following equation 2
ei i
s e
n
k 1
where n is the number of observations and k is the number of explanatory variables in the equation. Fortunately, you can interpret s e exactly as before. It is a measure of the typical prediction error when the multiple regression equation is used to predict the response variable. The coefficient of determination r 2 is again the proportion of variation in the response variable y explained by the combined set of explanatory variables x1 , x2 , , xk . In fact, it even has the same formula as before:
2
r
Regression Sum of Squares
SSR
Total Sum of Squares
SST
In the OmniPower example (see Excel output), SSR = 39,472,730.77
and
SST = 52,093,677.44
Thus, r 2
SSR
39,472,730.77
SST
52,093,677 .44
0.7577
The coefficient of determination indicates that 75.77% 76% of the variation in sales is explained by the variation in the price and in the promotional expenditures. The square root of r 2 is the correlation r between the fitted values y and the observed values y of the response variable – in both simple and multiple regression. ˆ
A graphical indication of the correlation can be seen in the plot of fitted (predicted) y values versus observed y values. If the regression equation gave perfect predictions, all of the points in this plot would lie on a 45º line – each fitted value would equal the corresponding observed value. Although a perfect fit virtually never occurs, the closer the points are to a 45º line, the better the fit is. ˆ
0.7577 0.87 indicating a strong The correlation in the OmniPower example is r relationship between the two explanatory variables and the response variable. This is confirmed by the scatterplot of y values versus y values: ˆ
-6-
Predicted versus Observed Bars 6000 5000
s r a B 4000 d e 3000 t c i d e 2000 r P
1000 0 0
1000
2000
3000
4000
5000
6000
Observed Bars
Inferences about Regression Coefficients t Tests for Significance
In a simple linear regression model, to test a hypothesis H 0 : slope
, we used the test statistic t
b SE b
0 concerning the population
with df = n – 2 degrees of freedom.
Similarly, in multiple regression, to test a hypothesis concerning the population slope
j
for variable x j (holding constant the effects of all other explanatory variables),
we use the test statistic t
b j SE b j
H 0
:
H a
:
j j
0 0
with df = n – k – 1, where k is the number of the explanatory
variables in the regression equation. In our example, to determine whether variable x2 (amount of promotional expenditures) has a significant effect on sales, taking into account the price of OmniPower bars, the null and alternative hypotheses are H 0 : 2 0 H a :
The test statistic is t
b2
3.6131
SE b2
0.6852
2
0
5.2728 with df = n – k – 1 = 34 – 2 – 1 = 31
The P -value is extremely small. Therefore, we reject the null hypothesis that there is no significant relationship between x2 (promotional expenditures) and y (sales) and conclude that there is a strong significant relationship between promotional expenditures and sales, taking into account the price x1 . For the slope of sales with price, the respective test statistic and P -value are: t = – 7.7664, P -value Thus, there is a significant relationship between price x1 and sales, taking into account the promotional expenditures x2 . -7-
0.
If we fail to reject the null hypothesis for a multiple regression coefficient, it does not mean that the corresponding explanatory variable has no linear relationship to y. It means that the corresponding explanatory variable contributes nothing to modeling y after allowing for all the other explanatory variables. The parameter
j
in a multiple regression model can be quite different from zero even when it is
possible there is no simple linear relationship between x j and y. The coefficient of x j in a multiple regression depends as much on the other explanatory variables as it does on x j . It is even possible that the multiple regression slope changes sign when a new variable enters the regression model.
Confidence Intervals
To estimate the value of a population slope
j
in multiple regression, we can use the following
confidence interval b j
t * SE j
where t * is the critical value for a t distribution with df = n – k – 1 degrees of freedom. To construct a 95% confidence interval estimate of the population slope
1
(the effect of price
x1 on sales y, holding constant the effect of promotional expenditures x2 ), the critical value of t at the 95% confidence level with 31 degrees of freedom is t * = 2.0395. (Note: For df = 30, the t -Table gives t * = 2.042.) Then, using the information from the Excel output, b1
t * SE 1 = – 53.2173
2.0395(6.8522) = – 53.2173
13.9752 = – 67.1925 to – 39.2421
Taking into account the effect of promotional expenditures, the estimated effect of a 1-cent increase in price is to reduce mean sales by approximately 39.2 to 67.2 bars. You have 95% confidence that this interval correctly estimates the relationship between these variables. From a hypothesis-testing viewpoint, because this confidence interval do es not include 0, you conclude that the regression coefficient 1 , has a significant effect. The 95% confidence interval for the slope of sales with promotional expenditures is b2
t * SE 2 = 3.6131
2.0395(0.6852) = 3.6131
1.3975 = 2.2156 to 5.0106
Thus, taking into account the effect of price, the estimated effect of each additional dollar of promotional expenditures is to increase mean sales by approximately 2.22 to 5.01 bars. You have 95% confidence that this interval correctly estimates the relationship between th ese variables. From a hypothesis-testing viewpoint, because this confidence interval do es not include 0, you can conclude that the regression coefficient 2 has a significant effect.
-8-
F Test for the Overall Fit
In simple linear regression, the t test and the F test provide the same conclusion; that is, if the null hypothesis is rejected, we conclude that 0 . In multiple regression, the t test and the F test have different purposes. The t test of significance for a specific regression coefficient in multiple regression is a test for the significance of adding that variable into a regression model, given that the other variables are included. In other words, the t test for the regression coefficient is actually a test for the contribution of each explanatory variable. The overall F test is used to determine whether there is a significant relationship between the response variable and the entire set of explanatory variables. We also say that it determines the expl anator y power of the model. The null and alternative hypotheses for the F test are: H 0 :
1
2
H a : At least one
j
(There is no significant relationship between the response variable and the explanatory variables.)
0
k
(There is a significant relationship between the response variable and at least one of the explanatory variables.)
0
Failing to reject the null hypothesis implies that the ex planatory variables are of little or no use in explaining the variation in the response variable; that is, the regression model predicts no better than just using the mean. Rejection of the null hypothesis implies that at least one of the explanatory variables helps explain the variation in y and therefore, the regression model is useful. The ANOVA table for multiple regression has the following form. Source of Variation
Degrees Sum of of Squares Freedom
Regression
k
SSR
MSR
Error
n – k – 1
SSE
MSE
Total
n 1
SST
Mean Squares (Variance) SSR k SSE n
F statistic F
P-value
MSR
Prob > F
MSE
k 1
The F test statistic follows an F -distribution with k and (n – k – 1) degrees of freedom. For our example, the hypotheses are: H 0 :
1
H a :
1
2
and/or
0 2
is not equal to zero
The corresponding F distribution has df1 = 2 and df2 = n – 2 – 1 = 34 – 3 = 31 degrees of freedom. The test statistic is F = 48.4771 and the corresponding P -value is P -value = FDIST(48.4771,2,31) = 0.00000000029
0
We reject H 0 and conclude that at least one of the explanatory variables (price and/or promotional expenditures) is related to sales. -9-
Residual Analysis
Three types of residual plots are appropriate for multiple regression. 1. Residuals versus y ’s (the predicted values of y) ˆ
This plot should look patternless. If the residuals show a pattern (e.g. a trend, bend, clumping), there is evidence of a possible curvilinear effect in at least one explanatory variable, a possible violation of the assumption of equal variance, and/or the need to transform the y variable. 2. Residuals versus each x Patterns in the plot of the residuals versus an explanatory variable may indicate the existence of a curvilinear effect and, therefore, the need to add a curvilinear explanatory variable to the multiple regression model. 3. Residuals versus time This plot is used to investigate patterns in the residuals in order to validate the independence assumption when one of the x-variables is related to time or is itself time. Below are the residual plots for the OmniPower sales example. There is very little or no pattern in the relationship between the residuals and the predicted value of y, the value of x1 (price), or the value of x2 (promotional expenditures). Thus, you can conclude that the multiple regression model is appropriate for predicting sales. There is no need to plot the residuals versus time because the data were not collected in time order. Residuals versus Predicted Bars 1500 1000 500 s l a u d i s e R
0 0
1000
2000
3000
4000
-500 -1000 -1500 -2000
Predicted Bars
- 10 -
5000
6000
Price Residual Plot
Promotion Residual Plot
1500
1500
1000
1000
500 s l a u d i s e R
500 s l a u d i s e R
0 0
50
100
150
-500
0 0
200
400
600
800
-500 -1000
-1000
-1500
-1500
-2000
-2000
Promotion
Price
The third regression assumption states that the errors are normally distributed. We can check it the same way as we did it in simple regression – by forming a histogram or a normal probability (Q-Q) plot of the residuals. If the third assumption holds, the histogram should be approximately symmetric and bell-shaped, and the points in the normal probability plot should be close to a 450 line. But if there is an obvious skewness, too many residuals more than, say, two standard deviations from the mean, or some other nonnormal property, this indicates a violation of the third assumption. Neither the histogram, nor the normal probability plot for the OmniPower example shows any severe signs of departure from normality. Q-Q Normal Plot of Residual / Data Set #2
Histogram of Residual / Data Set #2 12
3.5
10
2.5 e u l a V Q d e z i d -3.5 r a d n a t S
8 y c n e u 6 q e r F 4 2
1.5 0.5
-2.5
-1.5
-0.5 -0.5
0.5
-1.5 -2.5
0 1 0 . 5 6 4 1 -
9 0 . 3 3 0 1 -
8 1 . 1 0 6 -
7 2 . 9 6 1 -
4 6 . 2 6 2
6 5 . 4 9 6
7 4 . 6 2 1 1
-3.5 Z-Value
- 11 -
1.5
2.5
3.5
Collinearity
Most explanatory variables in a multiple regression problem are correlated to some degree with one another. For example, in the OmniPower case the correlation matrix is Price ( x1 ) Price ( x1 )
Promotion ( x2 )
Bars ( y)
1.0000
Promotion ( x2 )
– 0.0968
1.0000
Bars ( y)
– 0.7351
0.5351
1.0000
The correlation between price and promotion is – 0.0968. Thus, we find some degree of linear association between the two explanatory variables. Low correlations among the explanatory variables generally do not result in serious deterioration of the quality of the least squares estimates. However, when the explanatory variables are highly correlated, it becomes difficult to determine the separate effect of any particular explanatory variable on the response variable. We interpret the regression coefficients as measuring the change in the response variable when the corresponding explanatory variable increases by 1 unit while all the other explanatory variables are held constant. The interpretation may be impossible when the explanatory variables are highly correlated, because when the explanatory variable changes by 1 unit, some or all of the other explanatory variables will change. Collinearity (also called multicollinearity or intercorrelation ) is a condition that exists when two or more of the explanatory variables are highly correlated with each other. When highly correlated explanatory variables are included in the regression model, they can adversely affect the regression results. Two of the most serious problems that can arise are:
1. The estimated regression coefficients may be far from the population parameters, including the possibility that the statistic and the parameter being estimated may ha ve opposite signs. For example, the true slope 2 might actually be +10 and b2 , its estimate, might turn out to be – 3. 2. You might find a regression that is very highly significant based on the F test but for which not even one of the t tests of the individual x variables is significant. Thus, variables that are really related to the response variable can look like they aren’t related, based on their P -values. In other words, the regression result is telling you that the x variables taken as a group explain a lot about y, but it is impossible to single out any particular x variables as being responsible. Statisticians have developed several routines for determining whether collinearity is high enough to cause problems. Here are the three most widely used techniques: 1. Pairwise correlations between x’ s The rule of thumb suggests that collinearity is a potential problem if the absolute value of the correlation between any two explanatory variables exceeds 0.7. (Note: Some statisticians suggest a cutoff of 0.5 instead of 0.7.) 2. Pairwise correlations between y and x’ s The rule of thumb suggests that collinearity may be a serious problem if any of the pairwise correlations among the x variables is larger than the largest of the correlations between the y variable and the x variables. - 12 -
3. Variance inflation factors The statistic that measures the degree of collinearity of the j-th explanatory variable with the other explanatory variables is called variance inflation factor (VIF) and is found as: VIF j
1 2
1 r j
where r j2 is the coefficient of determination for a regression model using va riable x j as the response variable and all other x variables as explanatory variables. The VIF tells how much the variance of the regression coefficient has been inflated due to collinearity. The higher the VIF , the higher the standard error of its coefficient and the less it can 2 contribute to the regression model. More specifically, the r j shows how well the j-th explanatory variable can be predicted by the other explanatory variables. The 1 – r j2 term measures what that explanatory variable has left to bring to the model. If r j2 is high, then not only is that variable superfluous, but it can damage the regression model. Since r j2 cannot be less than zero, the minimum value of the VIF is 1. If a set of explanatory variables is uncorrelated, then each r j2 = 0.0 and each VIF j is equal to 1. As r j2 increases, VIF j increases also. For example, if r j2 = 0.9, then VIF j = 1/(1 – 0.9) = 10; if r j2 = 0.99, then VIF j = 1/(1 – 0.99) = 100. How large the VIF s must be to suggest a serious problem with collinearity is not completely clear. In general, any individual VIF j larger than 10 is considered as an indication of a potential collinearity problem. (Note: Some statisticians suggest using the cutoff of 5 instead of 10.) In the OmniPower sales data, the correlation between the two explanatory variables, price and promotional expenditure, is – 0.0968. Because there are only two explanatory variables in the model, VIF 1
VIF 2
1 1
0.0968
2
1.009
Since all VIF ’s (two in this example) are less than 10 (or, less than the more conservative value of 5), you can conclude that there is no problem with collinearity for the OmniPower sales d ata. One solution to the collinearity problem is to delete the variable with the largest VIF value. The reduced model is often free of collinearity problems. Another solution is to redefine some of the variables so that each x variable has a clear, unique role in x explaining y. For example, if x1 and x2 are collinear, you might try using x1 and the ratio 2 instead. x1 If possible, every attempt should be made to avoid including explanatory variables that are highly correlated. In practice, however, strict adherence to this policy is rarely achievable.
- 13 -