3.7.3 Choice of Weights With Principal Components
Principal component analysis is best performed on random variables whose standard deviations are reflective of their relative significance for an application. This is because principal component analysis depends upon both the correlations between random variables and the standard deviations of those random variables. If we were to change standard deviations of a set of random variables but leave their correlations the same, this would change their principal components. In a sense, principal component analysis uses standard deviation as a metric of significance. If one random variable has a standard deviation that far exceeds the rest, that random variable will dominate the first eigenvector.
Unfortunately, there may be no correspondence between a random variable’s standard deviation and its significance. Standard deviations depend upon the units in which a random variable is measured. Suppose a random variable reflects the time it takes for some event to occur, and if the random variable is measured in days, it has a standard deviation of 13.5. If the standard deviation is measured in hours, it is 324. Measured in minutes, it becomes 19,440. Certainly, the 19,440 standard deviation is no more significant than the 13.5 standard deviation, but principal component analysis will treat it as more significant!
If we use principal components only to orthogonalize a random vector, this will not be a problem. No information is lost. It will be a problem if principal components are discarded to form an approximation. In this case, information is lost. Before we discard principal components that appear “insignificant,” we should make sure that they truly are insignificant.
There are various solutions to this problem. We might insist that all random variables be measured in the same units, but this is not always feasible. If one random variable represents temperature and another represents volume, these are fundamentally different quantities. Also, identical units do not necessarily correspond to identical significance. Suppose we are analyzing blood samples for lead, and we have a random variable for each component of the blood. All components are measured in parts per million (ppm). Measured in ppm, the standard deviation of lead will be trivial compared to standard deviations for other constituents of the blood. Yet, the lead component is the most important random variable!
Alternatively, we might apply principal component analysis to normalized random variables:
[3.69]
With this approach, we effectively apply principal component analysis to the random variables’ correlation matrix. This represents a different weighting from that obtained by measuring all random variables in identical units, but not necessarily a better one.
Any solution may be reasonable in certain contexts and unreasonable in others. Each one weights the random variables in some manner. There is no objective way to assign weights, just as there is no objective way to assign “significance.” Weights and “significance” can and should vary from one application to another. When we use principal components to reduce the dimensionality of a random vector, there is subjectivity in the process.