PCA and BSP
Principal Components Analysis and Bivariate Scatter Plots
PCA and BSP are programs bundled in a workbook because they may often be used together.
PCA (Principal Components Analysis) is a well-known technique in multivariate data analysis. PCA aims at reducing dimensionality. Statistical objects, originally described by many attributes (columns), can be reduced to a smaller set of derived attributes, so that the complexity is diminished and patterns will be more easily recognized.
BSP (Binary Scatter Plots) displays the objects of a statistical population in bivariate scatter plots. Examining these plots, the user may detect patterns or tendencies which exist in the population. BSP may be applied both with the original data and the data which have been reduced by means of PCA.
The tool is accompanied by a short manual and published in form of a normal Excel workbook with makros. In the current version, the code is protected by a password and thus not visible. Both the Excel file and the pdf of the manual can be downloaded at the bottom of this page.
PCA / BSP is completely free of charge for private use. It has been tested with the Excel 2010 and 2007 versions, but the author does not provide any warranty or service. Commentaries are always welcome.
Several screen shots are shown below to illustrate how to work with the tool. For more details, see the manual.
Using program PCA, a Principal Components Analysis will be done in two steps:
- Extract Factors: Based on the original data matrix, the program will elaborate how the principal components (the factors) can be expressed in terms of the original attributes. As you may know, the factors can be described as linear composites of the original attributes.
- Factor Scores: Having examined the results of the first step, the user will decide on the number of factors which are important. Then he will use the program to calculate factor scores on the basis of the chosen factors.
Organization of the PCA user interface
The PCA form is made up of a general part and two subpages situated on the left (see picture below).
Subpage Explore may be used to explore input data and results of the program without having to close the form. There are buttons that let you leaf through the worksheets of the workbook, and there are also buttons available that let you scroll the selected worksheet.
Subpage PCA scores will be used to set the parameters for the second step of the PCA procedure.
The first step: extract
An input data matrix for PCA looks like the one in the picture below. There is a header with the names of the columns, the first of which should contain the ids of the data objects.
The next picture shows how the connection with the data set is established by the user. When the user has pressed the button ‘connect to pop & start PCA’ the results of the first step are written to a worksheet (see message at the bottom of the form).
The output of the first step consists of three parts.
First, there is a table showing how the components (factors) are derived from the original attributes (‘PCA table’). As you might know, every factor Pi can be expressed as a linear composite of the original attributes, which are listed in the first column of the table.
The numbers in the inner cells of the table are the weights of the original attributes within these linear composites. In our example, factor P1 is derived from the original attributes using the linear composite
P1 = 0.26267159 A1 + 0.35845527 A2 + … - 0.39949828 A11
Mathematically, the weights correspond to the elements of the eigenvectors of the population’s correlation matrix.
In the lower part of the table there is valuable information about the variance covered by each of the components. These variances correspond to the eigenvalues (latent roots) of the correlation matrix.
The second part of the output is also a table, called the 'factor loadings table', or 'loadings table', for short (next picture). Its lower part shows quite the same information as is shown in the lower part of PCA table.
The upper part contains the so-called factor loadings, i.e. the correlations of the original attributes with the components (factors) extracted by pca. They are obtained by multiplying the factor weights, listed in PCA table, by the square root of the corresponding variance.
The third part of the output is made up by a scree plot (picture below). The sree plot shows the variances covered by the factors in a more illustrative way than does a table. Scree plots are often used as a basis for the decision on the number of factors to consider.
The second step: factor scores
For the factor scores, the values of the original variables are z-standardized, i.e. to mean = 0 and standard deviation = 1. The scores are then obtained as linear composites of the standardized values. Of the total set of components only those chosen by the user will be calculated (principal components).
Suppose that the user has made up his mind for the 3-factor solution. He will then use the PCA scores subpage of the form to set the parameters for the scores. He will change the number of components from 11 (total number of factors) to 3, covering 77.6 % of total variance. Then he will press button “Show PCA scores” on the right side of the form. The message at the bottom will tell him that the results have been written to a worksheet (see picture below).
On the worksheet, the user will find the following table, containing the factor scores for the three most important factors:
Having Pressed the Start BSP button, the user will see the form shown on the right of the picture below. He will then use the mouse to indicate the range where the data matrix resides.
When the button “load data matrix” has been pressed and the data set has been loaded from the worksheet, the two combo boxes in the middle of the form will be filled with the names of the attributes. The scatter plot will be displayed on the left of the form, showing the data of the first two columns of the data matrix (see next picture).