Lesson 7: Principal Components Analysis (PCA) Introduction Sometimes data are collected on a large number of variables from a single population. As an example consider the Places Rated dataset below Example: Places Rated
In the Places Rated Almanac, Boyer and Savageau rated 329 communities according to the following nine criteria: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Climate and Terrain Housing Health Care & the Environment Crime Transportation Education The Arts Recreation Economics
Note that within the dataset, except for housing and crime, the higher the score the better. For housing and crime, the lower the score the better. Where some communities might do better in the arts, other communities might be rated better in other areas such as having a lower crime rate and good educational opportunities. Objective
With a large number of variables, the dispersion matrix may be too large to study and interpret properly. There would be too many pairwise correlations between the variables to consider. Graphical display of data may also not be of particular help incase the data set is very large. With 12 variables, for example, there will be more than 200 three-dimensional scatterplots to be studied! To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component. (There is another very useful data reduction technique called Factor Analysis, which will be taken up in a subsequent lesson.)
Learning objectives & outcomes Upon completion of this lesson, you should be able to do the following: Carry out a principal components analysis using SAS and Minitab Assess how many principal components should be considered in an analysis;
Interpret principal component scores. Be able to describe a subject with a high or low score; Determine when a principal component analysis may be based on the variance-covariance matrix, and when the correlation matrix should be used; Understand how principal component scores may be used in further analyses.
7.1 - Principal Component Analysis (PCA) Procedure Suppose that we have a random vector X. \(\textbf{X} = \left(\begin{array}{c} X_1\\ X_2\\ \vdots \\X_p\end{array}\right)\) with population variance-covariance matrix \(\text{var}(\textbf{X}) = \Sigma = \left(\begin{array}{cccc}\sigma^2_1 & \sigma_{12} & \dots &\sigma_{1p}\\ \sigma_{21} & \sigma^2_2 & \dots &\sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \dots & \sigma^2_p\end{array}\right)\) Consider the linear combinations \(\begin{array}{lll} Y_1 & = & e_{11}X_1 + e_{12}X_2 + \dots + e_{1p}X_p \\ Y_2 & = & e_{21}X_1 + e_{22}X_2 + \dots + e_{2p}X_p \\ & & \vdots \\ Y_p & = & e_{p1}X_1 + e_{p2}X_2 + \dots +e_{pp}X_p\end{array}\) Each of these can be thought of as a linear regression, predicting Y i from X 1, X 2, ... , X p . There is no intercept, but ei1, ei2, ..., eip can be viewed as regression coefficients. Note that Y i is a function of our random data, and so is also random. Therefore it has a population variance \[\text{var}(Y_i) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl} = \mathbf{e}'_i\Sigma\mathbf{e}_i\] Moreover, Y i and Y j will have a population covariance \[\text{cov}(Y_i, Y_j) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl} = \mathbf{e}'_i\Sigma\mathbf{e}_j\] Here the coefficients eij are collected into the vector \(\mathbf{e}_i = \left(\begin{array}{c} e_{i1}\\ e_{i2}\\ \vdots \\ e_{ip}\end{array}\right)\) First Principal Component ( PCA1): Y 1 The first principal component is the linear combination of x-variables that has maximum variance (among all linear combinations), so it accounts for as much variation in the data as possible.
Specifically we will define coefficients e11 , e12 , ... , e1 p for that component in such a way that its variance is maximized, subject to the constraint that the sum of the squared coefficients is equal to one. This constraint is required so that a unique answer may be obtained. More formally, select e11 , e12 , ... , e1 p that maximizes \[\text{var}(Y_1) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl} = \mathbf{e}'_1\Sigma\mathbf{e}_1\] subject to the constraint that \[\mathbf{e}'_1\mathbf{e}_1 = \sum_{j=1}^{p}e^2_{1j} = 1\] Second Principal Component ( PCA2): Y 2 The sec ond principal component is the linear combination of x-variables that accounts for as much of the remaining variation as possible, with the constraint that the correlation between the first and second component is 0
Select e21 , e22 , ... , e2 p that maximizes the variance of this new component... \[\text{var}(Y_2) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl} = \mathbf{e}'_2\Sigma\mathbf{e}_2\] subject to the constraint that the sums of squared coefficients add up to one, \[\mathbf{e}'_2\mathbf{e}_2 = \sum_{j=1}^{p}e^2_{2j} = 1\] along with the additional constraint that these two components will be uncorrelated with one another. \[\text{cov}(Y_1, Y_2) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl} = \mathbf{e}'_1\Sigma\mathbf{e}_2 = 0\] All subsequent principal components have this same property – they are linear combinations that account for as much of the remaining variation as possible and they are not correlated with the other principal components
We will do this in the same way with each additional component. For instance: i th Principal Component ( PCAi ): Y i
We select ei1, ei2, ... , eip that maximizes \[\text{var}(Y_i) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl} = \mathbf{e}'_i\Sigma\mathbf{e}_i\] subject to the constraint that the sums of squared coefficients add up to one...along with the additional constraint that this new component will be uncorrelated with all the previously defined components. \(\mathbf{e}'_i\mathbf{e}_i = \sum_{j=1}^{p}e^2_{ij} = 1\)
\(\text{cov}(Y_1, Y_i) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{il}\sigma_{kl} = \mathbf{e}'_1\Sigma\mathbf{e}_i = 0\), \(\text{cov}(Y_2, Y_i) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{il}\sigma_{kl} = \mathbf{e}'_2\Sigma\mathbf{e}_i = 0\), \(\vdots\) \(\text{cov}(Y_{i-1}, Y_i) = \sum_{k=1}^{p}\sum_{l=1}^{p}e_{i-1,k}e_{il}\sigma_{kl} = \mathbf{e}'_{i-1}\Sigma\mathbf{e}_i = 0\) Therefore all principal components are uncorrelated with one another.
7.2 - How do we find the coefficients? How do we find the coefficients eij for a principal component? The solution involves the eigenvalues and eigenvectors of the variance-covariance matrix Σ. Solution: We are going to let λ 1 through λ p denote the eigenvalues of the variance-covariance matrix Σ. These are ordered so that λ 1 has the largest eigenvalue and λ p is the smallest. \(\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p\) We are also going to let the vectors e1 through e p e1, e2, ... , e p
denote the corresponding eigenvectors. It turns out that the elements for these eigenvectors will be the coefficients of our principal components. The variance for the ith principal component is equal to the ith eigenvalue. \(\textbf{var}(Y_i) = \text{var}(e_{i1}X_1 + e_{i2}X_2 + \dots e_{ip}X_p) = \lambda_i\) Moreover, the principal components are uncorrelated with one another. \(\text{cov}(Y_i, Y_j) = 0\) The variance-covariance matrix may be written as a function of the eigenvalues and their corresponding eigenvectors. This is determined by using the Spectral Decomposition Theorem. This will become useful later when we investigate topics under factor analysis. Spectral Decomposition Theorem
The variance-covarianc matrix can be written as the sum over the p eigenvalues, multiplied by the product of the corresponding eigenvector times its transpose as shown in the first expression below:
\[\begin{array}{lll} \Sigma & = & \sum_{i=1}^{p}\lambda_i \mathbf{e}_i \mathbf{e}_i' \\ & \cong & \sum_{i=1}^{k}\lambda_i \mathbf{e}_i\mathbf{e}_i'\end{array}\] The second expression is a useful approximation if \(\lambda_{k+1}, \lambda_{k+2}, \dots , \lambda_{p}\) are small. We might approximate Σ by \[\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\] Again, this will become more useful when we talk about factor analysis. Earlier in the course we defined the total variation of X as the trace of the variance-covariance matrix, or if you like, the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues as shown below: \(\begin{array}{lll}trace(\Sigma) & = & \sigma^2_1 + \sigma^2_2 + \dots +\sigma^2_p \\ & = & \lambda_1 + \lambda_2 + \dots + \lambda_p\end{array}\) This will give us an interpretation of the components in terms of the amount of the full variation explained by each component. The proportion of variation explained by the ith principal component is then going to be defined to be the eigenvalue for that component divided by the sum of the eigenvalues. In other words, the ith principal component explains the following proportion of the total variation: \[\frac{\lambda_i}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\] A related quantity is the proportion of variation explained by the first k principal component. This would be the sum of the first k eigenvalues divided by its total variation. \[\frac{\lambda_1 + \lambda_2 + \dots + \lambda_k}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\] Naturally, if the proportion of variation explained by the first k principal components is large, then not much information is lost by considering only the first k principal components. Why It May Be Possible to Reduce Dimensions When we have correlations (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line. That line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total= verbal +math, meaning the third variable is completely determined by the first two. The reason for saying “at most” two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data .
Note
All of this is defined in terms of the population variance-covariance matrix Σ which is unknown. However, we may estimate Σ by the sample variance-covariance matrix which is given in the standard formula here: \[\textbf{S} = \frac{1}{n-1} \sum_{i=1}^{n}(\mathbf{X}_i-\bar{\textbf{x}})(\mathbf{X}_i\bar{\textbf{x}})'\]
Procedure
Compute the eigenvalues \(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\) of the sample variance-covariance matrix S, and the corresponding eigenvectors \(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \dots, \hat{\mathbf{e}}_p\). Then we will define our estimated principle components using the eigenvectors as our coefficients: \(\begin{array}{lll} \hat{Y}_1 & = & \hat{e}_{11}X_1 + \hat{e}_{12}X_2 + \dots + \hat{e}_{1p}X_p \\ \hat{Y}_2 & = & \hat{e}_{21}X_1 + \hat{e}_{22}X_2 + \dots + \hat{e}_{2p}X_p \\&&\vdots\\ \hat{Y}_p & = & \hat{e}_{p1}X_1 + \hat{e}_{p2}X_2 + \dots + \hat{e}_{pp}X_p \\ \end{array}\) Generally, we only retain the first k principal component. Here we must balance two conflicting desires: 1. To obtain the simplest possible interpretation, we want k to be as small as possible. If we can explain most of the variation just by two principal components then this would give us a much simpler description of the data. The smaller k is the smaller amount of variation is explained by the first k component. 2. To avoid loss of information, we want the proportion of variation explained by the first k principal components to be large. Ideally as close to one as possible; i.e., we want \[\frac{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_k}{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_p} \cong 1\]
7.3 - Example: Places Rated We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Climate and Terrain Housing Health Care & Environment Crime Transportation Education The Arts Recreation Economics
Notes: The data for many of the variables are strongly skewed to the right. The log transformation was used to normalize the data. Using SAS Using Minitab
The SAS program places.sas will implement the principal component procedures:
When you examine the output, the first thing that SAS does is to give us summary information. There are 329 observations representing the 329 communities in our dataset and 9 variables. This is followed by simple statistics that report the means and standard deviations for each variable. Below this is the variance-covariance matrix for the data. You should be able to see that the variance reported for climate is 0.01289. What we really need to draw our attention to here is the eigenvalues of the variance-covariance matrix. In the SAS output the eigenvalues in ranked order from largest to smallest. These values have been copied into Table 1 below for discussion. -
Data Analysis: Step 1: We examine the eigenvalues to determine how many principal components should be considered:
Table 1. Eigenvalues, and the proportion of variation explained by the principal components.
Component Eigenvalue Proportion Cumulative
1
0.3775
0.7227
0.7227
2
0.0511
0.0977
0.8204
3
0.0279
0.0535
0.8739
4
0.0230
0.0440
0.9178
5
0.0168
0.0321
0.9500
6
0.0120
0.0229
0.9728
7
0.0085
0.0162
0.9890
8
0.0039
0.0075
0.9966
9
0.0018
0.0034
1.0000
Total
0.5225
If you take all of these eigenvalues and add them up and you get the total variance of 0.5223. The proportion of variation explained by each eigenvalue is given in the third column. For example, 0.3775 divided by the 0.5223 equals 0.7227, or, about 72% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 0.7227 plus 0.0977 equals 0.8204, and so forth. Therefore, about 82% of the variation is explained by the first two eigenvalues together. Next we need to look at successive differences between the eigenvalues. Subtracting the second eigenvalue 0.051 from the first eigenvalue, 0.377 we get a difference of 0.326. The difference between the second and third eigenvalues is 0.0232; the next difference is 0.0049. Subsequent differences are even smaller. A sharp drop from one eigenvalue to the next may serve as another indicator of how many eigenvalues to consider. The first three principal components explain 87% of the variation. This is an acceptably large percentage. An Alternative Method to determine the number of principal components is to look at a Scree Plot. With the eigenvalues ordered from largest to the smallest, a scree plot is the plot of versus i. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size. The following plot is made in Minitab.
The scree plot for the variables without standardization (covariance matrix) As you see, we could have stopped at the second principal component, but we continued till the third component. Relatively speaking, contribution of the third component is small compared to the second component. Step 2: Next, we will compute the principal component scores. For example, the first principal component can be computed using the elements of the first eigenvector:
\(\begin{array}\hat{Y}_1 & = & 0.0351 \times (\text{climate}) + 0.0933 \times (\text{housing}) + 0.4078 \times (\text{health})\\ & & + 0.1004 \times (\text{crime}) + 0.1501 \times (\text{transportation}) + 0.0321 \times (\text{education}) \\ && 0.8743 \times (\text{arts}) + 0.1590 \times (\text{recreation}) + 0.0195 \times (\text{economy})\end{array}\) In order to complete this formula and compute the principal component for the individual community of interest, plug in that community's values for each of these variables. A fairly standard procedure is, rather than using the raw data here, to use the difference between the variables and their sample means. This is known as translation of the random variables. Translation does not affect the interpretations because the variances of the original variables are the same as those of the translated variables. Magnitudes of the coefficients give the contributions of each variable to that component. However, the magnitude of the coefficients also depend on the variances of the corresponding variables.
7.4 - Interpretation of the Principal Components Step 3: To interpret each component, we must compute the correlations between the original data for each variable and each principal component.
These correlations are obtained using the correlation procedure. In the variable statement we will include the first three principal components, "prin1, prin2, and prin3", in addition to all nine of the original variables. We will use these correlations between the principal components and the original variables to interpret these principal components. Because of standardization, all principal components will have mean 0. The standard deviation is also
given for each of the components and these will be the square root of the eigenvalue. More important for our current purposes are the correlations between the principal components and the original variables. These have been copied into the following table. You will also note that if you look at the principal components themselves that there is zero correlation between the components. Principal Component Variable
1
2
3
Climate
0.190
0.017
0.207
Housing
0.544
0.020
0.204
Health
0.782
-0.605 0.144
Crime
0.365
0.294 0.585
Transportation
0.585
0.085
0.234
Education
0.394
-0.273
0.027
Arts
0.985
0.126
-0.111
Recreation
0.520 0.402 0.519
Economy
0.142
0.150
0.239
Interpretation of the principal components is based on finding which variables are most strongly correlated with each component, i.e., which of these numbers are large in magnitude, the farthest from zero in either positive or negative direction. Which numbers we consider to be large or small is of course is a subjective decision. You need to determine at what level the correlation value will be of importance. Here a correlation value above 0.5 is deemed important. These larger correlations are in boldface in the table above: We will now interpret the principal component results with respect to the value that we have deemed significant. First Principal Component Analysis - PCA1
The first principal component is strongly correlated with five of the original variables. The first principal component increases with increasing Arts, Health, Transportation, Housing and Recreation scores. This suggests that these five criteria vary together. If one increases, then the remaining two also increase. This component can be viewed as a measure of the quality of Arts, Health, Transportation, and Recreation, and the lack of quality in Housing (recall that high values for Housing are bad). Furthermore, we see that the first principal component correlates most strongly with the Arts. In fact, we could state that based on the correlation of 0.985 that this principal component is primarily a measure of the Arts. It would follow that communities with high values would tend to have a lot of arts available, in terms of theaters, orchestras, etc. Whereas communities with small values would have very few of these types of opportunities. Second Principal Component Analysis - PCA2
The second principal component increases with only one of the values, decreasing Health. This component can be viewed as a measure of how unhealthy the location is in terms of available health care
including doctors, hospitals, etc. Third Principal Component Analysis - PCA3
The third principal component increases with increasing Crime and Recreation. This suggests that places with high crime also tend to have better recreation facilities. To complete the analysis we often times would like to produce a scatter plot of the component scores. In looking at the program, you will see a gplot procedure at the bottom where we are plotting the second component against the first component. A similar plot can also be prepared in Minitab, but is not shown here.
Each dot in this plot represents one community. So if you were looking at the red dot out by itself to the right you may conclude that this particular dot has a very high value for the first principal component and we would expect this community to have high values for the Arts, Health, Housing, Transportation and Recreation. Whereas if you look at red dot at the left of the spectrum, you would expect to have low values for each of those variables. The top dot in blue has a high value for the second component. So you would expect that this community would be lousy for Health. And conversely if you were to look at the blue dot on the bottom, the corresponding community would have high values for Health. Further analyses may include: Scatter plots of principal component scores. In the present context, we may wish to identify the locations of each point in the plot to see if places with high levels of a given component tend to be clustered in a particular region of the country, while sites with low levels of that component are clustered in another region of the country. Principle components are often treated as dependent variables for regression and analysis of variance.
7.5 - Alternative: Standardize the Variables In the previous example we looked at principal components analysis applied to the raw data. In our earlier discussion we noted that if the raw data is used principal component analysis will tend to give more emphasis to those variables that have higher variances than to those variables that have very low variances. In effect the results of the analysis will depend on what units of measurement are used to measure each variable. That would imply that a principal component analysis should only be used with the raw data if all variables have the same units of measure. And even in this case, only if you wish to give those variables which have higher variances more weight in the analysis. A unique example of this type of implementation might be in an ecological setting where you are looking at counts of different species of organisms at a number of different sample sites. Here, one may want to give more weight to the more common species that are observed. By analysing the raw data you will tend to find that more common species will also show higher variances and will be given more emphasis. If you were to do a principal component analysis on standardized counts, all species would be weighted equally regardless of how abundant they are and hence, you may find some very rare species entering in as significant contributors in the analysis. This may or may not be desirable. These types of decisions need to be made with the scientific foundation and questions in mind. Summary
The results of principal component analysis depend on the scales at which the variables are measured. Variables with the highest sample variances will tend to be emphasized in the first few principal components. Principal component analysis using the covariance function should only be considered if all of the variables have the same units of measurement. If the variables either have different units of measurement (i.e., pounds, feet, gallons, etc), or if we wish each variable to receive equal weight in the analysis, then the variables should be standardized before a principal components analysis is carried out. Standardize the variables by subtracting its mean from that variable and dividing it by its standard deviation: \[Z_{ij} = \frac{X_{ij}-\bar{x}_j}{s_j}\] where X ij = Data for variable j in sample unit i \(\bar{x}_{j}\)= Sample mean for variable j s j = Sample standard deviation for variable j We will now perform the principal component analysis using the standardized data. Note: the variance-covariance matrix of the standardized data is equal to the correlation matrix for the unstandardized data. Therefore, principal component analysis using the standardized data is equivalent to principal component analysis using the correlation matrix. Principal Component Analysis Procedure
The principal components are first calculated by obtaining the eigenvalues for the correlation matrix: \(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\) In this matrix we denote the eigenvalues of the sample correlation matrix R , and the corresponding eigenvectors \(\mathbf{\hat{e}}_1, \mathbf{\hat{e}}_2, \dots, \mathbf{\hat{e}}_p\) Then the estimated principle components scores are calculated using formulas similar to before, but instead of using the raw data we will use the standardized data in the formulae below: \(\begin{array}{lll} \hat{Y}_1 & = & \hat{e}_{11}Z_1 + \hat{e}_{12}Z_2 + \dots + \hat{e}_{1p}Z_p \\ \hat{Y}_2 & = & \hat{e}_{21}Z_1 + \hat{e}_{22}Z_2 + \dots + \hat{e}_{2p}Z_p \\&&\vdots\\ \hat{Y}_p & = & \hat{e}_{p1}Z_1 + \hat{e}_{p2}Z_2 + \dots + \hat{e}_{pp}Z_p \\ \end{array}\) Rest of the procedure and the interpretations are as discussed before.
7.6 - Example: Places Rated after Standardization The previous analysis is repeated after standardizing the variables. Using SAS Using Minitab
The SAS program places1.sas will implement the principal component procedures using the standardized data:
The output begins with descriptive information including the means and standard deviations for the individual variables being presented. This is followed by the Correlation Matrix for the data. For example, the correlation between the housing and climate data was only 0.273. There are no hypothesis presented that these correlations are equal to zero. We will use this correlation matrix instead to obtain our eigenvalues and eigenvectors. -
We need to focus on the eigenvalues of the correlation matrix that correspond to each of the principal components. In this case, total variation of the standardized variables is going to be equal to p, the number of variables. After standardization each variable has variance equal to one, and the total variation is the sum of these variations, in this case the total variation will be 9. The eigenvalues of the correlation matrix are given in the second column in the table below. Note also the proportion of variation explained by each of the principal components, as well as the cumulative proportion of the variation explained.
Step 1
Examine the eigenvalues to determine how many principal components should be considered: Component Eigenvalue Proportion Cumulative
1
3.2978
0.3664
0.3664
2
1.2136
0.1348
0.5013
3
1.1055
0.1228
0.6241
4
0.9073
0.1008
0.7249
5
0.8606
0.0956
0.8205
6
0.5622
0.0625
0.8830
7
0.4838
0.0538
0.9368
8
0.3181
0.0353
0.9721
9
0.2511
0.0279
1.0000
The first principal component explains about 37% of the variation. Furthermore, the first four principal components explain 72%, while the first five principal components explain 82% of the variation. Compare these proportions with those obtained using non-standardized variables. This analysis is going to require a larger number of components to explain the same amount of variation as the original analysis using the variance-covariance matrix. This is not unusual. In most cases, the required cut off is pre-specified; i.e. how much of the variation to be explained is predetermined. For instance, I might state that I would be satisfied if I could explain 70% of the variation. If we do this then we would select the components necessary until you get up to 70% of the variation. This would be one approach. This type of judgment is arbitrary and hard to make if you are not experienced with these types of analysis. The goal - to some extent - also depends on the type of problem at hand. Another approach would be to plot the differences between the ordered values and look for a break or a sharp drop. The only sharp drop that is noticeable in this case is after the first component. One might, based on this, select only one component. However, one component is probably too few, particularly because we have only explained 37% of the variation. Consider the scree plot based on the standardized variables.
The scree plot for standardized variables (correlation matrix) Step 2
Next, we can compute the principal component scores using the eigenvectors. This is a formula for the first principal component: \(\begin{array} \hat{Y}_1 & = & 0.158 \times Z_{\text{climate}} + 0.384 \times Z_{\text{housing}} + 0.410 \times Z_{\text{health}}\\ & & + 0.259 \times Z_{\text{crime}} + 0.375 \times Z_{\text{transportation}} + 0.274 \times Z_{\text{education}} \\ && 0.474 \times Z_{\text{arts}} + 0.353 \times Z_{\text{recreation}} + 0.164 \times Z_{\text{economy}}\end{array}\) And remember, this is now going to be a function, not of the raw data but the standardized data. The magnitudes of the coefficients give the contributions of each variable to that component. Since the data have been standardized, they do not depend on the variances of the corresponding variables. Step 3
Next, we can look at the coefficients for the principal components. In this case, since the data are standardized, within a column the relative magnitude of those coefficients can be directly assessed. Each column here corresponds with a column in the output of the program labeled Eigenvectors. Principal Component Variable
1
Climate
0.158 0.069 0.800 0.377 0.041
Housing
0.384 0.139 0.080 0.197 -0.580
Health
0.410 -0.372 -0.019 0.113 0.030
Crime
2
3
4
5
0.259 0.474 0.128 -0.042 0.692
Transportation 0.375 -0.141 -0.141 -0.430 0.191 Education
0.274 -0.452 -0.241 0.457 0.224
Arts
0.474 -0.104 0.011 -0.147 0.012
0.353 0.292 0.042 -0.404 -0.306
Recreation Economy
0.164 0.540 -0.507 0.476 -0.037
Interpretation of the principal components is based on finding which variables are most strongly correlated with each component. In other words, we need to decide which numbers are large within each column. In the first column we will decide that Health and Arts are large. This is very arbitrary. Other variables might have also been included as part of this first principal component. Component Summaries First Principal Component Analysis - PCA1
The first principal component is a measure of the quality of Health and the Arts, and to some extent Housing, Transportation and Recreation. Health increases with increasing values in the Arts. If any of these variables goes up, so do the remaining ones. They are all positively related as they all have positive signs. Second Principal Component Analysis - PCA2
The second principal component is a measure of the severity of crime, the quality of the economy, and the lack of quality in education. Crime and Economy increase with decreasing Education. Here we can see that cities with high levels of crime and good economies also tend to have poor educational systems. Third Principal Component Analysis - PCA3
The third principal component is a measure of the quality of the climate and poorness of the economy. Climate increases with decreasing Economy. The inclusion of economy within this component will add a bit of redundancy within our results. This component is primarily a measure of climate, and to a lesser extent the economy. Fourth Principal Component Analysis - PCA4
The fourth principal component is a measure of the quality of education and the economy and the poorness of the transportation network and recreational opportunities. Education and Economy increase with decreasing Transportation and Recreation. Fifth Principal Component Analysis - PCA5
The fifth principal component is a measure of the severity of crime and the quality of housing. Crime increases with decreasing housing.
7.7 - Once the Components Have Been Calculated
One can interpret these component by component. One method of deciding how many components is to include only those that give unambiguous results, i.e., where no variable appears in two different columns as a significant contribution. Note that the primary purpose of this analysis is descriptive - it is not hypothesis testing! So your decision in many respects needs to be made based on what provides you with a good, concise description of the data. We have to make a decision as to what is an important correlation, not necessarily from a statistical hypothesis testing perspective, but from, in this case an urban-sociological perspective. You have to decide what is important in the context of the problem at hand. This decision may differ from discipline to discipline. In some disciplines such as sociology and ecology the data tend to be inherently 'noisy', and in this case you would expect 'messier' interpretations. If you are looking in a discipline such as engineering where everything has to be precise, you might put higher demands on the analysis. You would want to have very high correlations. Principal components analysis are mostly implemented in sociological and ecological types of applications as well as in marketing research. As before, you can plot the principal components against one another and we can explore where the data for certain observations lies. Sometimes the principal components scores will be used as explanatory variables in a regression. Sometimes in regression settings you might have a very large number of potential explanatory variables to work with. And you may not have much of an idea as to which ones you might think are important. What you might do is to perform a principal components analysis first and then perform a regression predicting the variables enters from the principal components themselves. The nice thing about this analysis is that the regression coefficients will be independent to one another, since the components are independent of one another. In this case you actually say how much of the variation in the variable of interest is explained by each of the individual components. This is something that you can not normally do in multiple regression. One of the problems that we have with this analysis is that because of all of the numbers involved, the analysis is not as 'clean' as one would like. For example, in looking at the second and third components, the economy is considered to be significant for both of those components. As you can see, this will lead to an ambiguous interpretation in our analysis. An alternative method of data reduction is Factor Analysis where factor rotations are used to reduce the complexity and obtain a cleaner interpretation of the data.
7.8 - Summary In this lesson we learned about: The definition of a principal components analysis; How to interpret the principal components; How to select the number of principal components to be considered; How to choose between doing the analysis based on the variance-covariance matrix or the correlation matrix. Look for this lesson's homework problems that will give you a chance to put what you have learned to use...