     Principal Components Analysis...
Instructor:          Jeremy Jackson   |    Winter, 2015

Office:                NW 3431    |   New Westminster
Theodor Adorno: "Love is the power to see similarity in the dissimilar."

Selected sections from: Yaya Keho"Ecole Nationale Supérieure de Statistique et d’Economie Appliquée (ENSEA), Abidjan Côte d’Ivoire

Introduction

Introduced by Pearson (1901) and Hotelling (1933), Principal Components Analysis has become a popular data-processing and dimension-reduction technique, with numerous applications in engineering, biology, economy and social science. Today, PCA can be implemented through statistical software by students and professionals but it is often poorly understood. The goal of this Chapter is to dispel the magic behind this statistical tool. The Chapter presents the basic intuitions for how and why principal component analysis works, and provides guidelines regarding the interpretation of the results. The mathematics aspects will be limited. At the end of this Chapter, readers of all levels will be able to gain a better understanding of PCA as well as the when, the why and the how of applying this technique. They will be able to determine the number of meaningful components to retain from PCA, create factor scores and interpret the components. More emphasis will be placed on examples explaining in detail the steps of implementation of PCA in practice.

The basic prerequisite – Variance and correlation

PCA is useful when you have data on a large number of quantitative variables and wish to reduce them into a smaller number of constructed variables that will account for most of the variance in the data. The method is mainly concerned with representing covariance and/or correlation in the data. Let us focus our attention to the meaning of these concepts. Consider the dataset given in Table 1. This dataset will serve to illustrate how PCA works in practice. The variance of a given variable x is defined as the average of the squared differences from the mean: The square root of the variance is the standard deviation and is symbolized by the small Greek sigma . It is a measure of how spread out numbers are. The variance and the standard deviation are important in data analysis because of their relationships to correlation and the normal curve. Correlation between a pair of variables measures to what extent their values co-vary. The term covariance is undoubtedly associatively prompted immediately. There are numerous models for describing the behavioral nature of a simultaneous change in values, such as linear, exponential and more. The linear correlation is used in PCA. The linear correlation coefficient for two variables x and y is given by: where ?x and ?denote the standard deviation of x and y, respectively. This definition is the most widely-used type of correlation coefficient in statistics and is also called Pearson correlation or product-moment correlation. Correlation coefficients lie between -1.00 and +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. Correlation coefficients are used to assess the degree of collinearity or similarity among variables. Notice that the value of correlation coefficient does not depend on the specific measurement units used.When correlations among several variables are computed, they are typically summarized in the form of a correlation matrix. For the five variables in Table 1, we obtain the results reported in Table 2. In this Table a given row and column intersect shows the correlation between the two corresponding variables. For example, the correlation between variables X1 and X2 is 0.94. As can be seen from the correlations, the five variables seem to hang together in two distinct groups. First, notice that variables X1, X2 and X3 show relatively strong correlations with one another. In the same way, variables X4 and X5 correlate strongly with each another. Notice that those two variables show very weak correlations with the rest of the variables.

Given that the 5 variables contain some "redundant" information, it is likely that they are not really representing five different independent factors, but two uncorrelated factors. What are these factors? To what extent does each variable measure each of these factors? One purpose of PCA is to provide answers to these questions. Let’s see how a 2 dimensional PCA works with the data in Table 1.

In linear PCA each of the two artificial variables (the 2 dimensions) is computed as a linear combination of the original variables. where ?j is the weight for variable j in creating the component Z. The value of Z for a subject represents the subject’s score on the principal component.

Using our dataset, we have: Notice that different coefficients were assigned to the original variables in computing subject scores on the two components. X1, X2 and X3 are assigned relatively large weights that range from 0.554 to 0.579, while variables X4 and X5 are assigned very small weights ranging from 0.098 to 0.126. As a result, component Z1 should account for much of the variability in the first three variables. In creating subject scores on the second component, much weight is given to X4 and X5, while little weight is given to X1, X2 and X3. Subject scores on each component are computed by adding together weighted scores on the observed variables. For example, the value of a subject along the first component Z1 is 0.579 times the standardized value of X1 plus 0.577 times the standardized value of X2 plus 0.554 times the standardized value of X3 plus 0.126 times the standardized value of X4 plus 0.098 times the standardized value of X5. The weights in the preceding equations are determined so that they are optimal in the sense that no other set of weights could produce components that best account for variance in the dataset.

Criteria for Determining the Number of Components to Retain

In principal component analysis the number of components extracted is equal to the number of variables being analyzed. This means that an analysis of our 5 variables would actually result in 5 components, not two. However, since PCA aims at reducing dimensionality, only the first few components will be important enough to be retained for interpretation. It is therefore reasonable to wonder how many independent components are necessary to best describe the data.

Eigenvalues are thought of as quantitative assessment of how much a component represents the data. The higher the eigenvalues of a component, the more representative it is of the data. Eigenvalues are therefore used to determine the importance of components. Table 3 provides the eigenvalues from the PCA applied to our dataset. In the column headed “Eignenvalue”, the eigenvalue for each component is presented. Each raw in the table presents information about one of the 5 components: the raw “1” provides information about the first component (PCA1) extracted, the raw “2” provides information about the second component (PCA2) extracted, and so forth. Eigenvalues are ranked from the highest to the lowest. It can be seen that the eigenvalue for component 1 is 2.653, while the eigenvalue for component 2 is 1.98. This means that the first component accounts for 2.653 units of total variance while the second component accounts for 1.98 units. The third component accounts for about 0.27 unit of variance. Note that the sum of the eigenvalues is 5, which is also the number of variables. How do we determine how many components are worth retaining?Several criteria have been proposed for determining how many components should be retained. This section will describe three criteria: the Kaiser eigenvalue-one criterion, the Cattell Scree test, and the cumulative percent of variance accounted for.

The Kaiser Method

The Kaiser (1960) method provides a handy rule of thumb that can be used to retain components. This rule suggests keeping only components with eigenvalues greater than 1. This method is also known as the eigenvalue-one criterion. The rationale for this criterion is straightforward. Each observed variable contributes one unit of variance to the total variance in the data set. Any component that displays an eigenvalue greater than 1 is accounts for a greater amount of variance than does any single variable. On the other hand, a component with an eigenvalue of less than
1 accounts for less variance than does one variable. The purpose of principal component analysis is to reduce variables into a relatively smaller number of components; this cannot be effectively achieved if we retain components that account for less variance than do individual variables. For this reason, components with eigenvalues less than 1 are of little use and are not retained.

However, this method can lead to retaining the wrong number of components under circumstances that are often encountered in research. The thoughtless application of this rule can lead to errors of interpretation when differences in the eigenvalues of successive components are trivial. For example, if component 2 displays an eigenvalue of 1.01 and component 3 displays an eigenvalue of 0.99, then component 2 will be retained but component 3 will not; this may mislead us into believing that the third component does not represent an important aspect of the variance in the set when, in fact, it accounts for almost exactly the same amount of variance as the second component. It is possible to use statistical tests to test for difference between successive eigenvalues. In fact, the Kaiser criterion ignores error associated with each eigenvalue due to sampling. Lambert, Wildt and Durand (1990) have proposed a bootstrapped version of the Kaiser approach to determine the interpretability of eigenvalues but we will not cover this method in this course.

Table 3 shows that the first component has an eigenvalue substantially greater than 1. It therefore explains more variance than a single variable, in fact 2.653 times as much. The second component displays an eigenvalue of 1.98, which is substantially greater than 1, and the third component displays an eigenvalue of 0.269, which is clearly lower than 1. The application of the Kaiser criterion, therefore, leads us to retain unambiguously the first two principal components.

The Cattell Scree Test

The scree test is another device for determining the appropriate number of components to retain. First, it graphs the eigenvalues against the component number. As eigenvalues are constrained to decrease monotonically from the first principal component to the last, the scree plot shows the decreasing rate at which variance is explained by additional principal components. To choose the number of components, we stop at the point the scree plot begins to level off (Cattell, 1966; Horn, 1965). The components that appear before the “break” are assumed to be important and are retained for further use; those appearing after the break are assumed to be unimportant and are not retained. Between the components before and after the break lies a scree.

The scree plot of eigenvalues derived from Table 3 is displayed in Figure 1. The component numbers are listed on the horizontal axis, while eigenvalues are listed on the vertical axis. The Figure shows a relatively large break appearing between components 2 and 3, meaning the each successive component is accounting for smaller and smaller amounts of the total variance. This agrees with the preceding conclusion that two principal components provide a very good summary of the data, accounting for about 93% of the total variance.

Sometimes a scree plot will display a pattern such that it is difficult to determine exactly where a break exists. When encountered, the use of the scree plot must be supplemented with additional criteria, such as the Kaiser method or the cumulative percent of variance accounted for criterion.

The Cumulative Percent of Total Variance Accounted For

When determining the number of meaningful components, remember that the subspace of components retained must account for a reasonable amount of variance in the data. It is usually typical to express the eigenvalues as a percentage of the total. The fraction of an eigenvalue out of the sum of all eigenvalues represents the amount of variance accounted for by the corresponding principal component. The cumulative percent of variance explained by the first q components is calculated with the formula: The number of principal components we should use depends on how big an rq we need. This criterion involves retaining all components up to a total percent variance (Lebart, Morineau & Piron, 1995; Jolliffe, 2002). It is recommended that the components retained account for at least 60% of the variance. The principal components that offer little increase in the total variance explained are ignored; those components are considered to be noise. When PCA works well, the first two eigenvalues usually account for more than 60% of the total variation in the data.

In our current example, the percentage of variance accounted for by each component and the cumulative percent variance appear in Table 3. From this Table we can see that the first component alone accounts for 53.057% of the total variance and the second component alone accounts for 39.597% of the total variance. Adding these percentages together results in a sum of 92.65%. This means that the cumulative percent of variance accounted for by the first two components is about 93%. This provides a reasonable summary of the data. Thus we can keep the first two components and “throw away” the other components.

A number of other criteria have been proposed to select the number of components in PCA . Users can read Lawley (1956), Horn (1965), Humphreys and Montanelli (1975), Horn and Engstrom (1979), Zwick and Velicer (1986), Hubbard and Allen (1987) and Jackson (1993), among others.

The Interpretation of Principal Components

Running a PCA has become easy with statistical software. However, interpreting the results can be a difficult task. Here are a few guidelines that should help practitioners through the analysis.

The Conceptual Approach of Correlation with the Component

Once the analysis is complete, we wish to assign a name to each retained component that describes its content. To do this, we need to know what variables are most strongly correlated with the components. Correlations of the variables with the principal components are useful tools that can help interpreting what components represent. The correlations between each variable and each principal component are given in Table 4. These correlations are also known as component loadings. A coefficient greater than 0.4 in absolute value is considered important (but this is a gross rule of thumb). We can interpret PCA1 as being highly positively correlated with variables X1, X2 and X3, and weakly positively correlated to variables X4 and X5. So X1, X2 and X3 are the most important variables in the first principal component. PCA2, on the other hand, is highly positively correlated with X4 and X5, and weakly negatively related to X1 and X2. So X4 and X5 are most important in explaining the second principal component. In the conceptual approach, the name of the first component is based on the meaning of X1, X2 and X3 while that of the second component comes from X4 and X5.

It can be shown that the coordinate of a variable on a component (the projection of the variable on to the component) is the correlation between that variable and the component. This allows us to plot the reduced two-dimensional representation of variables in the Euclidean space constructed from the first two components. Because variables highly correlated with a component have a small angle with the component, their projection on the component is generally large (a long way down the component...but see below). Figure 2 represents this graph for our dataset. For each variable we have plotted on the horizontal dimension its loading on component 1, on the vertical dimension its loading on component 2.

The graph also presents a visual aspect of correlation patterns among variables. The cosine of the angle θ between two vectors x and y is the correlation between them. From Figure 2 the angles between the variables show us that the five variables form two distinct groups. Variables X1, X2 and X3 are positively correlated with each other, and form the first group. Variables X4 and X5 also correlate strongly with each other, and form the second group. Those two groups are weakly correlated. In this way, Figure 2 gives a reduced dimension representation of the correlation matrix given in Table 2. It is extremely important, however, to notice that the angle between variables is interpreted in terms of correlation only when variables are well-represented, that is they are close to the border of the circle of correlation. Remember that the goal of PCA is to explain multiple variables by a lesser number of components, and keep in mind that graphs obtained from that reduction method are projections that optimize global criterion (i.e. the total variance). As such some relationships between variables may be greatly altered. Correlations between variables and components supply insights about variables that are not well-represented. In a subspace of components, the quality of representation of a variable is assessed by the sum-of-squared component loadings across components. This is called the communality of the variable. It measures the proportion of the variance of a variable accounted for by the components. For example, in our example, the communality of the variable X1 is 0.9432+0.2412=0.948. This means that the first two components explain about 95% of the

variance of the variable X1. This is quite substantial to enable us fully interpreting the variability in this variable as well as its relationship with the other variables. Communality can be used as a measure of goodness-of-fit of the projection. The communalities of the 5 variables of our data are displayed in Table 5. As shown by this Table, the first two components explain more than 80% of variance in each variable. This is enough to reveal the structure of correlation among the variables. Do not interpret as correlation the angle between two variables when at least one of them has a low communality. Using communality prevents potential biases that may arise by directly interpreting numerical and graphical results yielded by the PCA. These results show that outcomes from normalized PCA can be easily interpreted without additional complicated calculations. From a visual inspection of the graph, we can see the groups of variables that are correlated, interpret the principal components and name them.

Component/Factor Scores

A useful byproduct of PCA is component or factor scores. Factor scores are coordinates of subjects (individuals) on each component. They indicate where a subject stands on the retained component. Factor scores are computed as weighted values on the observed variables. Results for our dataset are reported in Table 6. Factor scores can be used to plot a reduced representation of subjects. This is displayed in Figure 3. How do we interpret the position of points on this diagram? Recall that this graph is a projection. As such some distances could be spurious. To distinguish wrong projections from real ones and better interpret the plot, we need to use that is called “the quality of representation” of subjects. This is computed as the squared of the cosine of the angle between a subject si and a component z , following the formula: Cos2 is interpreted as a measure of goodness-of-fit of the projection of a subject on a given component. Notice that in the equation, si is the distance of subject si from the origin. It measures how far the subject is from the center. So if cos2=1 the component extracted is reproducing a great amount of the original behavior of the subject. Since the components are orthogonal, the quality of representation of a subject in a given subspace of components is the sum of the associated cos2. This notion is similar to the concept of communality previously defined for variables.

In Table 6 we also reported these statistics. As can be seen, the two components retained explain more than 80% of the behavior of subjects, except for subjects 6 and 7. Now we are confident that almost all the subjects are well represented, we can interpret the graph. Thus, we can tell that subjects located in the right side and having larger coordinates on the first component, i.e.1, 9, 6, 3 and 5, have values of X1, X2 and X3 greater than the average. Those located in the left side and having smaller coordinates on the first axis, i.e. 20, 19, 18, 16, 12,
11 and 10, record lesser values for these variables. On the other hand, subjects 15 and 17 are characterized by highest values for variables X4 and X5, while subjects 8 and 13 record lowest values for these variables.
Very often a small number of subjects can determine the direction of principal components. This is because PCA uses the notions of mean, variance and correlation; and it is well known that these statistics are influenced by outliers or atypical observations in the data. To detect what are these atypical subjects we define the notion of “contribution” that measures how much a subject contributes to the variance of a component. Contributions (CTR) are computed as follows: Contributions are reported in the last two columns of Table 6. Subject 4 contributes greatly to the first component with a contribution of 16.97%. This indicates that subject 4 explains alone 16.97% of the variance of the first component. Therefore, this subject takes higher values for X1 , X2 and X3 . This can be easily verified from the original Table 1. Regarding the second component, over 25% of the variance of the data accounted for by this component is explained by subjects 15 and 17. These subjects exhibit high values for variables X4 and X5.

The principal components obtained from PCA could be used in subsequent analyses (regressions, poverty analysis, classification…). For example, in linear regression models, the presence of correlated variables poses the econometric well-known problem of multicolinearity that makes instable regression coefficients. This problem is avoided when using the principal components that are orthogonal with one another.

THE END......

All content copyright of Dr Jeremy Jackson - 2014. Douglas College, Vancouver, BC.