REGRESSION AND CORRELATION ANALYSIS
Learning Objectives
At the end of this lesson, the participants should be able to:
• • • •
discuss the basic concepts of the regression and correlation technique in characterizing associations between variables; enumerate the assumptions in regression analysis and the consequences of a violation of any of the assumptions; derive a regression model for a given set of data; and check for model adequacy through residual analysis, goodness-of-fit test and coefficient of determination
Introduction
When two or more characteristics are measured from each experimental unit, statistical inferences frequently frequently involve regression and correlation analysis. For example, when both grain yield and plant height are collected from a rice experiment, one may wish to determine how and to what magnitude the two characteristics are related to one another. The relationship of these characters may either be expressed in a functional form or in terms of the degree of their association association with one another. The techniques used in determining such relationships are known as regression and correlation. In crop research, associations between responses, treatments, environmental factors are frequently evaluated. Associations of particular particular interest are: are: 1. Associations between response variables (e.g. between weed density and tiller number or panicle weight). 2. Association between response and treatment (e.g. between grain yield and nitrogen rate). 3. Association between response and environment (e.g. between grain yield and rainfall).
Regression and Correlation Analysis
333
Correlation Analysis
Correlation is concerned with the study of linear dependency between variables. The correlation coefficient is a measure of the intensity of the linear relationship between two variables, it is applicable when there is no clear-cut cause and effect relationship between the variables. Correlation reflects the extent to which deviations from the mean in one variable are accompanied by proportional deviations in the other in either direction according to the sign of the correlation. The population correlation coefficient, ρ, measures the linear relationship between all possible values of two variables variables X and Y. It is defined as Cov( X, Y ) SD( X)SD( Y) The sample correlation coefficient is computed from the observed data as follows: r =
∑ Xi Yi − (∑ Yi)(∑ Xi) / n 2 2 [ ∑ Xi − ( ∑ Xi) / n ][∑ Yi − (∑ Yi ) / n ]
r is an estimate of ρ and it can be used to test the null hypothesis Ho: statistic used is tc =
r − ρ
ρ = ρo. The
test
(1 − r 2) /( n − 2)
tc has a t-distribution with (n -2) degrees of freedom when the population correlation is o ρo. A test for linear independence is made made by setting setting ρ = 0, i.e., testing Ho: ρ = 0 using this test statistic. The correlation coefficient (r) measures the strength of the linear relationship existing between observations on two variables. If two variables variables are independent, they have zero correlation. However, the converse is not necessarily true. A value of r equal to zero does not necessarily mean that there is no relationship between the variables. It may mean that the relationship is not linear. The value of r ranges from -1 to +1 with the extreme values indicating a close linear association and the mid-value, zero, indicating no linear association between the variables. A positive or negative negative value or r indicates the direction of change in one variable variable relative to the change in the other. That is, the value of r is negative when a positive change in one variable is associated with a negative change in another, and positive when the two variables change in the same direction.
334
Correlation Analysis
Figure 1 illustrates the various degrees of association between two variables as reflected in the r values.
Figure 1. Graphical representation representation of various values of simple correlation coefficient, r. Regression Analysis
Regression analysis is a statistical technique for investigating and modeling the relationship of a dependent (response) variable to a set of independent or explanatory variables. The relationship is expressed in the form of an equation connecting the response variable Y and one or more independent variables X1, X2, . . ., X p. The regression equation involves some parameters which must be estimated from the data. A regression equation containing only one independent variable is called a simple regression equation. An equation containing more than one variable is referred to as a multiple regression regression equation. The regression regression equation or model defines the structural form of the relationship by linking the dependent variable with the independent variables 2 through parameters, for example, Y = α + β1X1 + β2X1 where Y and X1 are variables and α, β1, β2 are parameters. If the relationship that exists between the dependent and the independent variables is a linear function of the parameters then it is termed a linear regression such as the equation above, or Y = β0 + β1X1+ β2X 2. On the other hand, a non-linear regression exists if the relationship between the dependent an d the independent X variables is non-linear in the parameters such as in Y = αβ where parameters α and β are multiplied. Note that the linearity refers to the parameters and not the variables. variables. For example, Y = α + βX1 + γX1X2 is nonlinear in the variables but is a linear regression model since it is linear in the parameters α, β, γ.
Regression and Correlation Analysis
335
Most often, regression analysis is done to: * * * * * *
obtain estimates of the parameters, estimate the variance of the error term, estimate the standard error of the parameter estimates, test hypotheses about the parameters, calculate predicted values using the estimated equation, evaluate the fit or lack of fit of the model
Linear Regression
The linear regression equation for response Y and p independent variables, X j, j =1, 2 ... p, takes the form: p
Y = β0 + ∑ β j X j + e j=1
where X j, j = 1, 2, . . ., p are independent variables, β0 and the β j’s are the regression coefficients and e is is a random error. The linear model assumes that: the expected values of the error terms are zero the variances of the error terms are unrelated to the magnitude of Y or X j (homoscedasticity) the error terms are uncorrelated the error terms are normally distributed
βɵ j , are derived from a set of n observations, Yi, Xi1, Xi2, ..., Xip , i = 1, 2 ... n, by the method of least squares. squares. This involves choosing values βɵ j for β j which minimize the sum of squares of the residuals: The parameter estimates,
ɵ ) Σ eɵ 2i = Σ(Yi −Y j
2
,
ɵ +Σ ɵ ɵ =β where Y β j Xij . i 0
With the assumption of normality among the error terms, the null hypothesis, Ho: β j = β j can be tested using the statistic o
t=
βɵ − βɵ j
j
βɵ s. e. j
where s.e. (β j) is computed from the residual SS,
336
Linear Regression
Σ eɵ 2i and the Xi values, but the
o
computation is complicated except for simple linear regression when s.e. (βˆ 1) = ((n − 1) /(n − 2) ∑ ei2) /( ∑ ( Xi1 − X1) 2 ) . βˆ j The s.e.’s are always always provided by computer programs. The test statistic statistic t follows the student’s T distribution with (n - p -1) degrees of freedom freedom on the null hypothesis. hypothesis. Setting o β j = 0 tests the hypothesis that Y and X j are not related by the stated regression model. The F-test of the ANOVA procedure is used to test the significance of the regression equation, that is, the null hypothesis Ho: β1 = β2 = . . . = β p = 0 (Note: β 0 is excluded). In the ANOVA for linear regression, the total SS of Y, SSY, is partitioned into two components, namely, the sum of squares due to regression (SSReg) and the sum of squares due to deviations from regression (SSE). Table 1. ANOVA for linear regression for testing Ho: β j = 0 vs. Ha: β j ≠ 0 all j = 1, 2, ..., p for same j. SV Regression
DF P
SS SSReg
MS MSR
Error
n-p-1
SSE
MSE
Total
n–1
SSY
F MSR/MSE
Confidence limits for βˆ j with coefficient (1 - α) are given by
βˆ j ± t (n − p − 1, α / 2)[s.e.(βˆ j )] where t(n-p-1, α/2) is the upper tail 100 α/2% critical value of Students’ T distribution with n-p-1 df. A useful measure of the importance of the independent variables is to consider the 2 coefficient of determination or squared multiple correlation R , where 2
R
=
SS Re g SSY
2 ˆ and so ranges from 0 to 1. It is also the R is the ordinary correlation between Yi andY i proportion of the variability in the response variable which is accounted for by the 2 regression analysis. A value of R close to unity indicates that the model fits the data well. With a good fit, fit, the observed and predicted values will be close to each other so SSE will be small. On the other hand, if the model does not fit fit well, SSE will be large large 2 2 and R will be near zero. The value of R is therefore used as a summary measure to judge the fit of the linear model to a given set of data.
Regression and Correlation Analysis
337
2
There are two dangers in using R alone as a measure of goodness of fit for a regression model. First, it gives no information on the appropriateness appropriateness of the model; a curve may be 2 well approximated by a straight line according to R , but the distinction may be very important from a practical point of view (Figure 2a) or the data may be segregated into 2 groups, but still appear well described according to R (Figure 2b).
Figure 2a. Curvilinear data fitted by a straight line with high R 2.
Figure 2b. Segregated data fitted by a straight line with high R 2.
For detecting these kinds of departures from the regression model there is no substitute to plotting the data. 2
The second problem with R is that as additional variables are included to a regression 2 equation, R tends to increase, regardless of the true importance of these variables in 2 2 determining the values of the dependent variable. A related statistic, statistic, the adjusted R (R a ) is used to compensate for for this effect. It is defined defined as 2
R a 2
= 1−
n −1 n − p
(1 − R 2).
The purpose of R a is to assist goodness of fit comparisons between regression equations which differ with respect to either the number of explanatory variables or the number of observations.
338
Linear Regression
Residual Analysis
A simple and effective method for detecting model adequacy in regression analysis is by examining the standardized residuals, * ˆ i ) / s.e.(eˆ i ) eˆ i = ( Yi − Y
If the fitted model is correct, the residuals should conform with the assumptions made on the error terms, that is, normality, n ormality, homoscedasticity, homoscedasticity, and independence. In general, when the model is correct, the standardized residuals tend to fall between -2 and +2 and are randomly randomly distributed about zero. The process of checking for model violations by analyzing residuals is useful for uncovering hidden structures in the data. If any standardized residual is outside the range -2.5 to + 2.5, it should be checked for possible data errors or other reasons for deviance from the model. Residual plots provide a visual indication of model adequacy and suggest possible modifications if it is inadequate. Some of the more commonly used plots are those in ˆ ; the independent which the standardized residuals are plotted against the fitted fitted value,Y variable X j; and the time order, t, in which the observations occur (if relevant). Essentially, Essentially, a good regression model should result in residuals whose graphs ˆ or and each X j, or t do not exhibit any distinct pattern of variation (Figure 3). against t, Y Any distinct patterns depicted in the residual plots (Figures 4a-c) indicate inadequacy of the fitted model or violations of the assumptions in the regression procedure.
Figure 3. A satisfactory residual plot should give this overall impression impression
Regression and Correlation Analysis
339
(a)
(b)
(c)
Figure 4. Plots indicating unsatisfactory unsatisfactory residuals residuals behavior. ˆ . Plot forms as in Figure 4 indicate: 1. Plot against Y i a) Variance not constant; need for weighted least squares or a transformation on the observations Yi before making a regression analysis. b) Systematic departure from the fitted equation (negative residuals correspond to ˆ ' s). This indicates an incorrect model such as ˆ ' s , positive residuals to high Y low Y use of linear regression regression when a curve curve is appropriate. This can also be caused by wrongly omitting a β0 term in the model. c) Model inadequate - need for extra terms in the model (e.g., square or cross product terms), or need for a transformation on the observations Yi before analysis. 2. If the data were collected collected in time order, a plot against time can be used to detect autocorrelation or time trends. trends. The plot forms in Figure 4 indicate: a) The variance is not constant; implying the need for weighted least squares analysis. b) A linear term in time should have been included in the model. c) Linear and quadratic terms in time should have been included in the model. 3. Plot against against the predictor variables X ji . Plot forms as in Figure Figure 4 indicate: a) Variance not constant; need for weighted least squares or a preliminary transformation on the Y’s. b) Error in calculations; linear effect of X j not removed. c) Need for extra terms in powers of X j in the model or a transformation on the Y’s.
340
Residual Analysis
References:
Chatterjee, S. and Price, B. Regression analysis by example. 1977. John Wiley and Sons, Inc. Inc. New York. Freund, R.J. and Littell, Littell, R.C. 1991. SAS system system for regression. 2nd. ed. SAS Institute, Institute, Inc. Cary, North Carolina. Carolina. Gomez, K.A. K.A. and Gomez, A.A. 1984. Statistical procedures for agricultural research. 2nd. ed. John Wiley and Sons, Inc. Inc. New York. Ratkowsky, D.A. 1983. Nonlinear regression modelling. Marcel Dekker, Inc. New York. Rayner, A.A. 1967. A first course in biometry biometry for agriculture students. University of Natal Press. South Africa. Africa. Woolson, R.F. 1987. Statistical methods for the analysis analysis of biomedical data. John Wiley and Sons, Inc. New York.
Regression and Correlation Analysis
341