Data Analysis On-line: Correlation,
Outlier Detection,
Principal Component Analysis, and Cluster Analysis
(last updated 1/30/2009)
Welcome to Data Analysis On-line! The purpose of this software to serve as
both a learning tool and as a means to enhance your business in credit risk and
marketing. Although this tool was never intended to replace more robust PC based
software such as SAS, RATS, LIMDEP, and SPSS, it's use is free and will provide sound statistical
analysis using correlation, outlier detection, principal component analysis, and
cluster analysis. What makes this tool different than the regression tool I also
provide is that there is no analysis on a target or "dependent" variable. Therefore, we concentrate more on understanding the data from a outlier and distributional
standpoint. This is the recommended
tool for preliminary analysis before you move into performing logistic regression - especially the principal component and clustering
sections.
Error Messages...
If you experience error messages, it is most likely due to the limited memory provided
to me by my hosting service. These resources can change by the minute based upon
available load on the server. Please email me if you receive any error messages.
If you do receive an error message, try to rerun your analysis again. To minimize
these problems, please keep your number of observations to 10,000 or less and the
number of variables to 50 or fewer. All comments and suggestions are appreciated.
Step 1 - Read in
Your Data
Missing Data...
If data is missing on some of your data, you should use some method
beforehand to provide proxy values such as means, medians, or mode. If you want
this software to perform such a task, code missing values as -999. The routine will
automatically substitute the mean value of the nonmissing data for observations
that have missing values. This would be applied to ALL variables coded using -999,
including categorical variables.
Upload your file below in a tab delimited format. Make sure your
first record is a header record identifying the names of your attribute. Do not
use any spaces in the header records. See the test file cluster.txt as an example
of the correct format. Use the browse button to select your file. To begin, press
the button "Read in File" and sequentially follow the recommended STEPS in looking more closely at your data.
The following summary highlights how many missing observations were coded as -999
in your data with their percentages in parenthesis. These values will be recoded
to take on mean values.
Step 2 - Basic Correlation,
Histograms, and Outlier Analysis
Pick from the following list of variables for histograms (frequency counts) of your
variables.
Examine each variable to determine if it was read in properly and to see if there
are outliers (values extremely high or low). As a result, you may want to resubmit
the input data with caps (set to the 99th percentile) on the data to remove these potential outliers as this
will impact the accuracy many multivariate techniques
such as regression and clustering.
MINIMUM VALUE MAXIMUM VALUE MEAN VALUE
STANDARD DEVIATION KURTOSIS
99TH PERCENTILE
75TH PERCENTILE
50 PERCENTILE
25TH PERCENTILE
5TH PERCENTILE MEDIAN
MID MEAN (25th-75th Percentile)
HARMONIC MEAN TRIMMED
(5) MEAN WINSORED (5) MEAN
OUTLIER DETECTION - As certain multivariate techniques are sensitive to outliers,
consider dropping records that have a high number of variables with values greater
than the 99th Percentile. Rather than eliminating records, you may decide to cap
their values at the 99th percentile, or some predetermined criteria. However, this
process should only account for no more than 1% or a most 2% of your observations.
As a note, if your variable is categorical or binary, the upper value may show up > 99th percentile and may be classified as an outlier in the list
below. This may be inappropriate. Therefore, outlier detection is typically applied
to variables with continuous numerical ranges.
OBS (# Variables >99th Percentile)
Examine pairwise correlations to get an idea of what variables are too correlated
with one another i.e. providing duplicate or overlapping information. Using two variables that are correlated
with values >|.7| in a regression or cluster analysis
tends to adversly impact your results.
Examine the Variance Inflation Factors - a single measure of too much correlation
among the variables. VIFs greater than 5 or 10 reflects the data being highly
correlated.
Step 3 - Select Variables
for Principal Components (PCA) and Cluster Analysis
Another way to look at your data is to use Principal Component Analysis. PCA is
a type of factor analysis where a transformation is made to the data by applying
weights such that the original variables are "decorrelated" with each other using
orthogonal rotations. The number of principal components are equal to the number
of variables, with each principal component "explaining" a certain portion of the
data. If all principal components are used, then you are explaining all the
variation in your data using all variables. For regression analysis, PCA can eliminate high
order components which do not greatly contribute to the explained variance by dropping
them from the analysis and using the resulting new PCAs as regressors.
These PCAs would then be perfectly uncorrelated with each other. However, explaining
the regression in terms of the original regression variables may sometimes be a problem - for example, in credit scoring. Regardless, PCA is an approach that helps you
in better understanding your data.
For Cluster Analysis, some people use the principal components rather than the original
data. Again, there may be difficulty in explaining the results in the profiling
part. However, you could use this website to calcuate the principle components for
later use in a regression or cluster solution.
Select Variables for the Analysis
Eigenvalues (Cumulative Explained Variance of PCs) - Look for those eigenvalues
that show marginal contributions to the cumulative explained variance. Sometimes it helps to see
these graphically as shown below. Some rules of thumb suggest any
principal component with eigenvalues <1 are candidates that contribute very little
to the overall explained variance.
This table shows the eigenvectors (loadings or correlations with the original variables).
Often loadings of .4 or above are statistically relevent.
To see the full results of the Principal Component Analysis in a downloadable text
file, press the following button.
One useful feature is that we can return a scrubbed version of the original variables
if we decide to drop those principle components that contribute little to the overall
explained variance. To do this, please select the fraction of explained variance
you which to retain. For example, if you choose .90, then the program will drop
those components reflecting a cumulative explained variance over .90 (maybe just
the last few components). As principal components are just a set of weights that
transforms each observation, the program simply sets the high order component weights
to zero that you wish to drop and reverses the transformation to get you back to
the original variables. Of couse, because you dropped some components, the values
will not be the same, but they will resemble your original variables. The maximum
threshold is 1, and minimum is .50.
Make your selection for this part here --->
Step 4 - Cluster
Analysis
Although other multivariate techniques (regression, CHAID, CART) consist of part
"art" and part "science", cluster analysis might be more heavily weighted towards
the artist's paintbrush. If ten different analysts are asked to develop a clustering
scheme for a product or service, you may end up with ten different cluster solutions
with none of them being necessarily wrong. The steps to building an effective cluster solution involve (1). Defining its use or context (2). Data collection (3). Data
standardization - done automatically in this tool (4). Outlier identification (shown earlier)
and removal (5). Correlation
analysis - seen through pairwise correlations and PCA
(6). Variable reduction - above using PCA and judgement (7). Defining the number of clusters
- provided below (8). Cluster estimation, also provided here (9). Cluster profiling
(see green table below). Usually steps
6-9 take multiple iterations until some meaningful cluster schemes are arrived at as differnt
solutions are developed and evaluated in terms of membership size and profiles (mean values of the original
variables). For example, do not include a cluster
that has a very small number of observations or that you may have difficulty in
explaining. Try a variety of cluster sizes below, depending on your objective.
<==Select number of clusters (usually 3 to 20)
Cluster Membership
Cluster Counts
Use this table to see how
your clusters differ from one another in terms of the average values of the original
variables. This is part of the profiling process where you might wish to explain
what these clusters "look like", and make up catchy marketing labels such as "technological
savvy", if that indeed is your goal.