Logistic Regression On-line:
Applications in Prospect Marketing & Risk Modeling
(last updated 3/6/2010)

Please send all comments / questions to me, Jeff Morrison, at my email address:  m_jeffer@bellsouth.net.

Welcome to Logistic Regression online! The purpose of this website is to serve as both a learning tool and as a means to enhance your business in credit risk and marketing. Although this website was never intended to replace more robust PC based software such as SAS, EVIEWS, RATS, LIMDEP, and SPSS, it's use is free and will produce statistically sound models for prospect marketing, credit risk, and other applications where the objective is to distinguish between two populations of interest. What makes this website unique is that it presents the entire model building process step by step from start to finish with helpful advice along the way, including many procedures that would have to be custom programmed even in advanced statistical software packages such as SAS or SPSS.  A good reference for this material can be found in 'Credit Risk Scorecards - Developing and Implementing Intelligent Credit Scoring', by Naeem Siddiqi, copyright 2006 by the SAS Institute.

Logistic Regression is a multivariate technique that attempts to quantify the relationship between as set of predictor variables and a target variable, sometimes called the dependent variable. The dependent variable is coded as either a "1" or "0" while the predictor variables can be continuous or binary in nature. The 0/1 coding of the dependent variable is very important. If you are doing a prospect (response or lookalike) model to add customers to your book of business, then you will want to code a response to a particular mailing campaign as a "1". A nonresponse would be coded as a "0". A logistic regression model coded this way will produce a score that reflects the probability of response. The higher the score, the greater the chance for a response to a mailing campaign.

On the other hand, credit scoring traditionally codes good payment behavior as a "1", and bad payment behavior as a "0". A logistic regression model coded this way will produce a score that reflects the probability of good payment behavior. The higher the score, the greater the chance for good payment behavior. Usually, payment performance is measured over an 18 to 24 month period - often called the "performance period". Values for the explanatory variables are typically collected at the start of the performance window, sometimes called the "observation point".

Higher Performance & Reliability

This website has been my pet project for a number of years, but finding a knowledgeable windows (.asp) web hosting provider has been very frustrating. The good news is that "Lunarpages" is now hosting my website and my analytic tools now work like they should - fast and reliable. However, I only have so much memory available, so please keep your number of observations between 500 & 5,000  and the number of variables between 10 & 50.  Some hints: First eliminate unpredictive variables from your input file. This is done by constructing your initial input file and running Steps 1-4 below. At the end of Step 4, request the WOE report which will show you which variables are essentially useless - variables with Information Value (IV) <.05. Eliminate those variables from your input file and load your smaller data set, beginning again with Step 1. In addition,  it is much more important to have an adequate or better yet near equal representation of both populations (say 500 to 2,000 of each outcome) rather than 50,000 observations with one population representing only 0.5% of the total. Designing your data this way will ensure optimal use of the sofware and help you build a better model more quickly. In order to conserve web resources, this application will time out if left idle for more than 30 minutes. If this happens, you may get an error message saying something about an invalid path. If this happens, you will need to start over again.

 
Step 1  - Read in Your Data

As this program does not create explanatory variables for your regression, you must create them on your own. Typically, if your variable is categorical, each value should be listed as a separate variable with a value of "0" or "1". These are called dummy variables and all but one should be included in your regression model. The modeling procedures will handle numeric variables only. It is recommended that you try to have an equal representation of event and non-event observations (1's and 0's for the dependent variable). For example, in a response model derived from a mailing campaign, response counts are often very small. Therefore, you would want to include more of these as a percentage of the mailed population than you would nonresponders. If you do, STEP 6 will require information about the original population to make an adjustment for uneven sampling. All graphs on this website can be easily copied and pasted into Excel or Word. In addition, all tables can also be copied into other documents.

Notes: The minimum data requirement to run this software is 100 observations. If you have less than that, then you may get an unspecified error during the model estimation phase. Also be sure not to include spaces at the end of your data (which often happens if you create your data in Excel). Extra spaces or erroneous characters which are not tab delimited will cause errors in Step 1. To make sure this does not occur, it is recommended that you open your tab delimited file in wordpad or notepad and go to the end, deleting anything, including extra spaces, which may occur after your last legitimate value.

Missing Data...

If data is missing on some of your explanatory variables, you should use some method beforehand to provide proxy values such as means, medians, or mode. If you want this software to perform such a task, code missing values as -999. The routine will automatically substitute the mean value of the nonmissing data for observations that have missing values. This would be applied to ALL variables coded using -999, including categorical variables.  However, more advanced approaches to credit scoring use the Weight of Evidence method to handle missing data directly in the scorecard development. This program will not treat missings as a separate group - it simply performs a mean replacement procedure.

Upload your file below in a tab delimited format specifying your binary dependent variable first followed by columns of potential predictor variables. Make sure your FIRST record is a header record identifying the names of your attribute. Do NOT use any spaces in the header records. See the test file survey.txt as an example of the correct format. Please do not load a file with an .XLS extension, as this is not a tab delimited file. Use the browse button to select your file. To begin, press the button "Read in File" and sequentially follow the recommended STEPS to produce your first logistic regression "on-line".


- Before you use your own data, I highly recommend that you experiment with the data provided in Survey.txt.   This data represents the hypothetical results of a marketing survey where 750 consumers were asked if they would buy a new kind of golf ball based on three pricing points, income, age, family size, gender, and if the consumer had ever used any other product of a certain brand. The results of the survey where coded as a "1" if the respondent answered with a 'yes (=1)' to the purchase decision, otherwise a 'no (=0)'. Logistic regression was then applied to the data and factors such as price, income, and age were found to be statistically significant, along with other variables from the survey. Given this information from an external list provider, the golf retailer could then compute each prospect's propensity to purchase the new golf ball, given their demographic profile and the product's final price offering. A marketing campaign might send special offers to those with the highest probability of purchase. To use this example file, simply click on the "Survey.txt" button above and save it somewhere on your hard drive. Then examine the file using  notepad to get an idea of the correct input format. Next click on the "browse" button below and select that file name from your hard drive's location, then click "Read-in File".

 
For all input files, carefully check the read-in results before continuing.

 

Read In Results:









Missing Data Results (coded as -999)
Variable Name (missing obs) [percent missing]



Step 2  - Examine Data for Outliers

The next step in modeling your data is to perform a preliminary analysis on the correlations among your potential predictor variables. In addition, you may want to see basic statistics such as minimum, mean, maximum, etc. Press the button called "Preliminary Analysis" to continue. This feature will produce a wealth of information on your data which will be discussed below.



MINIMUM VALUE                MAXIMUM VALUE               MEAN VALUE


STANDARD DEVIATION          KURTOSIS                                   99TH PERCENTILE


75TH PERCENTILE                      50 PERCENTILE                      25TH PERCENTILE
 

   5TH PERCENTILE                  MEDIAN


OUTLIER DETECTION - The results box below counts the number of variables for each record whose value exceeds the 99th percentile, considering them possible outliers. As regression analysis is sensitive to outliers, consider dropping records that have a high number of variables with values greater than the 99th percentile. Rather than eliminating records, you may decide to cap their values at the 99th percentile, or some predetermined criteria. However, this process should only account for no more than 1% or 2% of your observations - i.e true outliers. As a note, if your variable is categorical or binary, the upper value may show up as the 99th percentile and will be classified as an outlier in the list below. This may be inappropriate. Therefore, outlier detection is typically applied to variables with continuous numerical ranges. Highlighted below are the record numbers that have a value greater than the 99th percentile.

Record (count)


     


    

Step 3  - Examine Pairwise Correlations

The general rule is not to include variables in your model that are too highly correlated with other predictors. For example, including two variables that are correlated by .85 in your model may prevent the true contribution of each variable from being identified by the statistical algorithm. The econometric literature refers to this problem as 'multicollinearity'. For some in the credit scoring world, a good correlation rule to use is to exclude any explanatory variable that is correlated above .50 with another predictor variable. Further below when you are ready to select your regression variables, this software has a filtering option which helps you in eliminating variables with correlation problems. Highlighted below  are those pairwise correlations which are >.55 in absolute value.

Results of Pairwise Correlation




An even better way to see if your predictor variables are too correlated with one another is to look at the Variance Inflation Factors. The VIFs shown below examine the correlation among all your potential predictors collectively. If you see VIFs over 10, then there exists too much duplication in your data to include all variables in the model. Some argue that you should limit VIFs to 2.5 or less for logistic regression. For each regression, the software will recalculate the VIFs in your model so you can better understand if collinearity will impact your results. Drop variables one at a time that have high VIFs and rerun the regression. Highlighted below are those VIFs which have a value of 10 or more.

Results of V.I.F.



Step 4  - Examine Bivariate Correlations & Means

Further still, it is often beneficial to look at how each predictor variable is correlated with the dependent variable. These are sometimes referred to as "bivariate correlations". The results box below collects this information.  If the Pvalue is >.05, we would typically say that the correlation (and hence the variable) is not important. Highlighted below are those bivariate correlations which are statistically insignificant.

Results of Bivariate Correlations [Pvalues]


The means test below determines if there is a significant difference in the average value of the attribute between the two outcome popluations. The first number is the t-test, and to its right is the probability value (Pvalue)  associated with the t-test. If the t-stat is over 2 in absolute value (i.e. the Pvalue <.05), then that attribute might be a good predictor candidate in explaining the outcome variable. Highlighted below are those variables which were found to have statistically the same means across the target population. It would be doubtful if these variables would prove useful in your modeling exercise.

Results of Means Test Across Predictors



Graphs
The following information shows the minimums, maximums, and means of the two outcome populations. The first graph is a frequency count of the variable. The second graph is a picture of the variable describing the distribution between the two populations. Good predictor variables will tend to show differences between the red bars and the blue bars. 

Weight of Evidence (WOE) and Information Value (IV)
The third graph shows two important measures of potential predictiveness - Weight of Evidence and Information Value. The WOE is derived by dividing each variable into five equally divisible groups and calculating the ratio of what share each bin has of the total 1's and 0's. Then we take their log. In practice, modeling experts might start off by breaking the number of groups for each variable into 50 equally divisible bins (groups) and calculate the WOE for each group. Next, they might fine tune the bins by collapsing some in order to get a monotonic trend, or simply to better distinguish differences across bins from a WOE perspective. In this program, we present five bins to give the user a rough idea of the nonlinear nature of each variable.  Although you can build a regression model more quickly without worrying about the WOE procedure (i.e. use the variables in their continuous 'raw' form), scoring models can sometimes be significantly enhanced by using the weight of evidence approach.

The WOE Report can be downloaded below which will show you the code for the transformation as well as a list of the information value for all the variables. The Information Value (IV) is a similar measure to WOE, but describes the overall potential predictiveness of the variable. Low values of IV will tend not to be predictive (values less than .05). In fact, this program allows the user to employ IV measures as a filtering alternative.


Min, Max, and Means of Outcome Population (1's) Min, Max, and Means of Outcome Population (0's)


 
                                           
  Frequency Counts for the overall sample


      Blue = 0's Population    Red = 1's Population

 Weight of Evidence (WOE)


     Select Graph Variable                             
  
                                                                             Information Value (IV) = Label
  

 

When you are finished examining data and graphs above, proceed to Step 5 and select variables for your initial model.

Step 5  - Select Modeling Variables

Now it is time to select predictor variables for the model. Try to select variables that are highly correlated with the dependent variable but not too highly correlated with other predictors. Use the information from the correlation analysis and means analysis above to make this initial selection. As PC based statistical software such as SAS may offer an automatic variable selection routine such as stepwise regression, server resources prevent such a feature from being implemented here. However, if the user selects a low IV criteria such as <.05, nonpredictive variables can be be eliminated immediately. Furthermore, use of this feature automatically eliminates (filters) explanatory variables that are too correlated with one another (>.55), a process that traditional stepwise procedures do not address. 
 
     Select Predictors for Logistic Regression
   

  Minimum Information Value (IV)
  Maximum Allowed Pairwise Correlation =+/-.55

   

 


Step 6 - Build Model

Now you are ready to build (estimate) your initial model. Simply press the button below and examine the statistical information that follows. Model building is an iterative process, so you will want to revisit your variable selection to see if you can build a model that is even more predictive and makes sense from a business or statistical perspective. If your sample is not purely random because you placed more 1's for you dependent variable than seen in the original population, please input the original population percentage of 1's so your score will reflect the correct adjusted probabilities. This adjustment simply reflects a change to the intercept component of your model and not your estimated coefficients.

Below, you are also given the choice to also build a logistic regression using the WOE transformations which were derived by using quintile breaks. These results present a scorecard highlighting possible nonlinearity in your model.

Modeling Choice


  
     
  For uneven sampling scheme (leave blank  otherwise)
  Percent Population (1's) %
          Percent Sample  (1's)      %

  Intercept Adjustment        0.00


Step 7  - Examine Modeling Results

The regression results below show the weights (coefficients) for each predictor along with a test as to their statistical significance. Also shown are the probability values for these statistics. P-values less than .05 imply statistical significance along with t-values greater than 2 in absolute value. Also provided are the log odds of the coefficients which are very useful in interpreting the regression results. These values are simply the @exp (coefficients). For example, if you have a binary explanatory variable where a value of "1" represents gender = female and "0" otherwise, then a log odds value of 2.3 implies the chance of response for females is 2.3 times as likely than for males, all other things remaining equal. Finally, VIFs are also shown indicating potential multicollinearity in your data. In logistic regression, you want to keep VIF values below 2.5.


Logistic Regression Prediction Equation:




Also shown below are general statistics associated with logistic regression. These results show from an overall perspective if there is any predictive power in your model. You want the Pvalue to be less than .05, which is usually pretty easy to do in most models.

Other Regression Results
Loglikelihood Ratio

Loglikelihood Ratio P-Value


Model Sensitivity
The graph below shows how the probability value varies across the range of predictor values assuming all other variables in the model are equal to their sample means. This has the most meaning if your variables are continuous in nature rather than constructed as dummy variables. The reason why you might want to examine these graphs is because it shows you where your model is most sensitive to small changes in certain explanatory variables. (Note: you will notice that some of the modeling results disappear when you change the graph selection. This is to conserve memory. Simply hit the ESTIMATE MODEL button again in Step 6 to see your results.)


  

       
   



 

Logistic Regression (WOE) Prediction Equation:



 
Step 8  - Model Validation - Development Data

So, how good is your model? Shown below is what is traditionally called a LIFT CHART describing how well your model rank orders your data. This table is calculated by first sorting the data based on the model's probability value from high to low and then counting how many '1s' and '0s' are found in each 10% bin or interval with respect to your dependent variable. Calculated is the difference between the cumulative number of '1s' and '0s' for each of the 10% intervals. The maximum difference across the 10 buckets is called the KS value. The higher the KS (with a maximum equal to 100), the more predictive your model. Look to increase this more and more in subsequent models. The earlier this maximum occurs in terms of deciles, the better. Best practices dictate that a validation anlaysis be done on the data used for model development. It is also best practice to "hold out" a portion of the data from model development to see how the score performs on an independent sample. The "hold out" validation is optional here and can be developed in Step 9.

  
Shown below is a graph of the cumulative % of 1's population (top) and the cumulative % of 0's in your validation table. The greater the spread between these lines, the better your model is able to distinguish between the two populations of interest.



Model building is seldom a one time event. Typically as many as 5 to 10 different models are developed before a final
one is selected. The list below collects the results of each model along with the validation results for your review until you
select your final model. To clear the list, simply hit the button below and begin your modeling efforts again.

Model History 






The results below show the model's probability (score) for all records in the data along with the outcome (dependent variable) listed in the same sequence as the original input file. Highlighted are those records that fall in the first decile - i.e. the group with the highest (top 10%) score.

 Probability Score and Outcome for Development Sample






Step 9  - Model Validation - Hold Out Sample (Optional)

After you come up with your final model above, it is recommended that you perform an independent validation using hold out data. The data should be in the exact same form as your original model development data, with a target outcome variable depicted as a 0 or 1 in the first field, and potential predictor variables following afterwards. If your development data had 20 potential predictor variables in it, then your holdout sample should have the same 20 variables in it and in the same order, regardless of the number of variables in your final model. Your maximum KS value should be very close as to what you obtained for your development dataset. If you coded any missing values as -999, then a mean substitution procedure will be applied. Please use the following buttons to load your hold out sample for validation analysis.




Read In Results:






Missing Data Results (coded as -999)
Variable Name (missing obs) [percent missing]



 
If you are satisfied that your data has been read in correctly, then press the button below to create your validation results for the hold out sample. The results will be shown in the table and graph below.






Probability Score and Outcome for Holdout Sample






Step 10 - Profile the Results (Development Sample)

The next step is to describe in simple terms what the potential responders and nonresponders look like or those individuals predicted to have good or bad payment behavior. This can be accomplished by simply taking the means of those accounts that fall in the top and bottom deciles as shown below.

Profile Means 1st Decile (high scores)  Profile Means Last Decile (low scores)




Step 11  - Apply the Model to New Data

If your modeling efforts were successful, the next step might be to gather new data associated with those variables that were found statistically important and score them with the prediction algorithm. This can easily be done in Excel if you download and use the logistic equation scoring code in Step 7. In applying the model to production data, it is recommended you have some code in place to cap attribute values (say to some maximum value - 99th percentile) to prevent errors in data preparation for production. Also, some code should be added to handle any missing data. Finally, make sure that the population you are scoring "looks the same" from a distributional standpoint as the data from which your model was developed, or your results may be poor.



      Thanks for visiting Jeff Morrison's "Logistic Regression ONLINE"!