Data Analysis On-line: Correlation, Outlier Detection,
Principal Component Analysis, and Cluster Analysis

(last updated 1/30/2009)


Welcome to Data Analysis On-line! The purpose of this software to serve as both a learning tool and as a means to enhance your business in credit risk and marketing. Although this tool was never intended to replace more robust PC based software such as SAS, RATS, LIMDEP, and SPSS, it's use is free and will provide sound statistical analysis using correlation, outlier detection, principal component analysis, and cluster analysis. What makes this tool different than the regression tool I also provide is that there is no analysis on a target or "dependent" variable. Therefore, we concentrate more on understanding the data from a outlier and distributional standpoint. This is the recommended tool for preliminary analysis before you move into performing logistic regression - especially the principal component and clustering sections.

Error Messages...

If you experience error messages, it is most likely due to the limited memory provided to me by my hosting service. These resources can change by the minute based upon available load on the server. Please email me if you receive any error messages. If you do receive an error message, try to rerun your analysis again. To minimize these problems, please keep your number of observations to 10,000 or less and the number of variables to 50 or fewer.  All comments and suggestions are appreciated.

Step 1  - Read in Your Data

Missing Data...

If data is missing on some of your data, you should use some method beforehand to provide proxy values such as means, medians, or mode. If you want this software to perform such a task, code missing values as -999. The routine will automatically substitute the mean value of the nonmissing data for observations that have missing values. This would be applied to ALL variables coded using -999, including categorical variables. 

Upload your file below in a tab delimited format. Make sure your first record is a header record identifying the names of your attribute. Do not use any spaces in the header records. See the test file cluster.txt as an example of the correct format. Use the browse button to select your file. To begin, press the button "Read in File" and sequentially follow the recommended STEPS in looking more closely at your data.










The following summary highlights how many missing observations were coded as -999 in your data with their percentages in parenthesis. These values will be recoded to take on mean values.



 
Step 2 - Basic Correlation, Histograms, and Outlier Analysis




Pick from the following list of variables for histograms (frequency counts) of your variables.








Examine each variable to determine if it was read in properly and to see if there are outliers (values extremely high or low). As a result, you may want to resubmit the input data with caps (set to the 99th percentile) on the data to remove these potential outliers as this will impact the accuracy many multivariate techniques such as regression and clustering.

MINIMUM VALUE                MAXIMUM VALUE               MEAN VALUE


STANDARD DEVIATION          KURTOSIS                                   99TH PERCENTILE



75TH PERCENTILE                      50 PERCENTILE                      25TH PERCENTILE



   5TH PERCENTILE                  MEDIAN                           MID MEAN (25th-75th Percentile)
  

HARMONIC  MEAN                TRIMMED (5) MEAN              WINSORED (5) MEAN
  


OUTLIER DETECTION - As certain multivariate techniques are sensitive to outliers, consider dropping records that have a high number of variables with values greater than the 99th Percentile. Rather than eliminating records, you may decide to cap their values at the 99th percentile, or some predetermined criteria. However, this process should only account for no more than 1% or a most 2% of your observations. As a note, if your variable is categorical or binary, the upper value may show up > 99th percentile and may be classified as an outlier in the list below. This may be inappropriate. Therefore, outlier detection is typically applied to variables with continuous numerical ranges.

OBS  (# Variables >99th Percentile)



Examine pairwise correlations to get an idea of what variables are too correlated with one another i.e. providing duplicate or overlapping information. Using two variables that are correlated with values >|.7| in a regression or cluster analysis tends to adversly impact your results.


Examine the Variance Inflation Factors - a single measure of too much correlation among the variables. VIFs greater than 5 or 10 reflects the data being highly correlated.






 
Step 3 - Select Variables for Principal Components (PCA) and Cluster Analysis

Another way to look at your data is to use Principal Component Analysis. PCA is a type of factor analysis where a transformation is made to the data by applying weights such that the original variables are "decorrelated" with each other using orthogonal rotations. The number of principal components are equal to the number of variables, with each principal component "explaining" a certain portion of the data. If all principal components are used, then you are explaining all the variation  in your data using all variables. For regression analysis, PCA can eliminate high order components which do not greatly contribute to the explained variance by dropping them from the analysis and using the resulting new PCAs as regressors. These PCAs would then be perfectly uncorrelated with each other. However, explaining the regression in terms of the original regression variables may sometimes be a problem - for example, in credit scoring. Regardless, PCA is an approach that helps you in better understanding your data.

For Cluster Analysis, some people use the principal components rather than the original data. Again, there may be difficulty in explaining the results in the profiling part. However, you could use this website to calcuate the principle components for later use in a regression or cluster solution.

                     Select Variables for the Analysis
   
    
 

                                       


Eigenvalues (Cumulative Explained Variance of PCs) - Look for those eigenvalues that show marginal contributions to the cumulative explained variance. Sometimes it helps to see these graphically as shown below. Some rules of thumb suggest any principal component with eigenvalues <1 are candidates that contribute very little to the overall explained variance.




This table shows the eigenvectors (loadings or correlations with the original variables). Often loadings of .4 or above are statistically relevent.

To see the full results of the Principal Component Analysis in a downloadable text file, press the following button.

One useful feature is that we can return a scrubbed version of the original variables if we decide to drop those principle components that contribute little to the overall explained variance. To do this, please select the fraction of explained variance you which to retain. For example, if you choose .90, then the program will drop those components reflecting a cumulative explained variance over .90 (maybe just the last few components). As principal components are just a set of weights that transforms each observation, the program simply sets the high order component weights to zero that you wish to drop and reverses the transformation to get you back to the original variables. Of couse, because you dropped some components, the values will not be the same, but they will resemble your original variables. The maximum threshold is 1, and minimum is .50. Make your selection for this part here --->





 Step 4 - Cluster Analysis

Although other multivariate techniques (regression, CHAID, CART) consist of part "art" and part "science", cluster analysis might be more heavily weighted towards the artist's paintbrush. If ten different analysts are asked to develop a clustering scheme for a product or service, you may end up with ten different cluster solutions with none of them being necessarily wrong. The steps to building an effective cluster solution involve (1). Defining its use or context (2). Data collection (3). Data standardization  - done automatically in this tool  (4). Outlier identification (shown earlier) and removal (5). Correlation analysis - seen through pairwise correlations and PCA (6). Variable reduction - above using PCA and judgement (7). Defining the number of clusters  - provided below (8). Cluster estimation, also provided here (9). Cluster profiling (see green table below). Usually steps 6-9 take multiple iterations until some meaningful cluster schemes are arrived at as differnt solutions are developed and evaluated in terms of membership size and profiles (mean values of the original variables). For example, do not include a cluster that has a very small number of observations or that you may have difficulty in explaining. Try a variety of cluster sizes below, depending on your objective.

<==Select number of clusters (usually 3 to 20)

Cluster Membership


Cluster Counts


Use this table to see how your clusters differ from one another in terms of the average values of the original variables. This is part of the profiling process where you might wish to explain what these clusters "look like", and make up catchy marketing labels such as "technological savvy", if that indeed is your goal.