Logistic Regression On-line:
Applications in Prospect Marketing & Risk Modeling
(last updated 3/6/2010)
Please send all comments / questions to me, Jeff Morrison,
at my email address: m_jeffer@bellsouth.net.
Welcome to Logistic Regression online! The purpose of this website is to serve as
both a learning tool and as a means to enhance your business in credit risk and
marketing. Although this website was never intended to replace more robust PC based
software such as SAS, EVIEWS, RATS, LIMDEP, and SPSS, it's use is free and will produce
statistically sound models for prospect marketing, credit risk, and other applications
where the objective is to distinguish between two populations of interest. What makes this website unique is that it presents the entire model building process
step by step from start to finish with helpful advice along the way, including many
procedures that would have to be custom programmed even in advanced statistical
software packages such as SAS or SPSS.
A good reference for this material can be found in 'Credit
Risk Scorecards - Developing and Implementing Intelligent Credit Scoring', by Naeem
Siddiqi, copyright 2006 by the SAS Institute.
Logistic
Regression is a multivariate technique that attempts to quantify the relationship
between as set of predictor variables and a target variable, sometimes called the
dependent variable. The dependent variable is coded as either a "1" or "0" while
the predictor variables can be continuous or binary in nature. The 0/1 coding of
the dependent variable is very important. If you are doing a prospect (response or
lookalike) model to add customers to your book of business, then you will want
to code a response to a particular mailing campaign as a "1". A nonresponse would
be coded as a "0". A logistic regression model coded this way will produce a score
that reflects the probability of response. The higher the score, the greater the
chance for a response to a mailing campaign.
On the other hand, credit scoring traditionally
codes good payment behavior as a "1", and bad payment behavior as a "0". A logistic
regression model coded this way will produce a score that reflects the probability
of good payment behavior. The higher the score, the greater the chance for good
payment behavior. Usually, payment
performance is measured over an 18 to 24 month period - often called the "performance
period". Values for the explanatory variables are typically collected at the start of the performance
window, sometimes called the "observation point".
Higher Performance & Reliability
This website has been my pet project for a number of years, but finding a knowledgeable
windows (.asp) web hosting provider has been very frustrating. The good news is
that "Lunarpages" is now hosting my website and my analytic tools now work like
they should - fast and reliable. However, I only have so much memory available,
so please keep your
number of observations between 500 & 5,000 and the
number of variables between 10 & 50. Some hints: First eliminate unpredictive
variables from your input file. This is done by constructing your initial input
file and running Steps 1-4 below. At the end of Step 4, request the WOE report which
will show you which variables are essentially useless - variables with Information
Value (IV) <.05. Eliminate those variables from your input file and load your
smaller data set, beginning again with Step 1. In addition, it is much more
important to have an adequate or better yet near equal representation of both populations
(say 500 to 2,000 of each outcome) rather than 50,000 observations with one population
representing only 0.5% of the total. Designing your data this way will ensure optimal
use of the sofware and help you build a better model more quickly. In order to conserve web resources, this application
will time out if left idle for more than 30 minutes. If this happens, you may get
an error message saying something about an invalid path. If this happens, you will need to start over again.
Step 1 - Read in
Your Data
As this program does not create explanatory variables for
your regression, you must create them on your own. Typically, if your variable is
categorical, each value should be listed as a separate variable with a value of
"0" or "1". These are called dummy variables and all but one
should be included in your regression model. The modeling procedures will handle
numeric variables
only. It is recommended that you try to have an equal representation
of event and non-event observations (1's and 0's for the dependent variable). For
example, in a response model derived from a mailing campaign, response counts are
often very small. Therefore, you would want to include more of these as a percentage
of the mailed population than you would nonresponders. If you do, STEP 6 will require
information about the original population
to make an adjustment for uneven sampling. All graphs on this website can be easily
copied and pasted into Excel or Word. In addition, all tables
can also be copied into other documents.
Notes:
The minimum data requirement to run this software is
100 observations. If you have less than that, then you may get an
unspecified error during the model estimation phase. Also be sure
not to include spaces at the end of your data (which often happens if
you create your data in Excel). Extra spaces or erroneous characters which are not
tab delimited will cause errors in Step 1. To make sure this does not occur, it
is recommended that you open your tab delimited file in wordpad or notepad and go
to the end, deleting anything, including extra spaces, which may occur after your
last legitimate value.
Missing Data...
If data is missing on some of your explanatory variables, you should use some method beforehand to provide
proxy values such as means, medians, or mode. If you want
this software to perform such a task, code missing values as -999. The routine will
automatically substitute the mean value of the nonmissing data for observations
that have missing values. This would be applied to ALL variables coded using -999,
including categorical variables. However, more advanced approaches to credit
scoring use the Weight of Evidence method to handle missing data directly in
the scorecard development. This program will not treat missings as a separate group
- it simply performs a mean replacement procedure.
Upload
your file below in a tab delimited format specifying your binary dependent variable
first followed by columns of potential predictor variables. Make sure your
FIRST
record is a header record identifying the names of your attribute. Do NOT
use any spaces in the header records. See the test
file survey.txt as an example of the correct format. Please do not load a file with an .XLS extension, as this
is not a tab delimited file. Use the browse button to select
your file. To begin, press
the button "Read in File" and sequentially follow the recommended STEPS
to produce your first logistic regression "on-line".
- Before you use your own data, I highly recommend that you experiment with the data provided in Survey.txt.
This data represents the hypothetical results of a marketing survey where 750 consumers
were asked if they would buy a new kind of golf ball based on three pricing points,
income, age, family size, gender, and if the consumer had ever used any other product
of a certain brand. The results of the survey where coded as a "1" if the respondent
answered with a 'yes (=1)' to the purchase decision, otherwise a 'no (=0)'. Logistic
regression was then applied to the data and factors such as price, income, and age
were found to be statistically significant, along with other variables from the
survey. Given this information from an external list provider, the golf retailer
could then compute each prospect's propensity to purchase the new golf ball, given
their demographic profile and the product's final price offering. A marketing campaign
might send special offers to those with the highest probability of purchase. To use this example file, simply click on the "Survey.txt" button above
and save it somewhere on your hard drive. Then examine the file using notepad
to get an idea of the correct input format. Next click on the "browse" button below
and select that file name from your hard drive's location, then click "Read-in File".
For all input files, carefully check the
read-in results
before continuing.
Read In Results:
Missing Data Results (coded as -999)
Variable Name (missing obs) [percent missing]
Step 2 - Examine Data for Outliers
The next step in modeling your data is to perform a preliminary analysis on the
correlations among your potential predictor variables. In addition, you may want
to see basic statistics such as minimum, mean, maximum, etc. Press the button called
"Preliminary Analysis" to continue. This feature will produce a wealth of information
on your data which will be discussed below.
MINIMUM VALUE
MAXIMUM VALUE MEAN VALUE
STANDARD DEVIATION KURTOSIS
99TH PERCENTILE
75TH PERCENTILE
50 PERCENTILE
25TH PERCENTILE
5TH PERCENTILE
MEDIAN
OUTLIER DETECTION - The results box below counts the number of
variables for each record whose value exceeds the 99th percentile, considering
them possible outliers. As regression analysis is sensitive to outliers,
consider dropping records that have a high number of variables with values greater
than the 99th percentile. Rather than eliminating records, you may decide to cap
their values at the 99th percentile, or some predetermined criteria. However, this
process should only account for no more than 1% or 2% of your observations - i.e true outliers.
As a note, if your variable is categorical or binary, the upper value may
show up as the 99th percentile and will be classified as an outlier in the list
below. This may be inappropriate. Therefore, outlier detection is typically applied
to variables with continuous numerical ranges. Highlighted below are the record numbers that have a value greater than
the 99th percentile.
Record (count)
Step 3 - Examine Pairwise Correlations
The general rule is not to include variables in your model that are too highly correlated with
other predictors. For example, including two variables that are correlated by .85
in your model may prevent the true contribution of each variable from being identified
by the statistical algorithm. The econometric literature refers to this
problem as 'multicollinearity'. For some in the credit scoring world, a good correlation
rule to use is to exclude any explanatory variable that is correlated above .50
with another predictor variable. Further below when you are ready to select your
regression variables, this software has a filtering option which helps you in eliminating
variables with correlation problems. Highlighted below are those pairwise
correlations which are >.55 in absolute value.
Results of Pairwise Correlation
An even better way to see if your predictor variables are too correlated with one another
is to look at the Variance Inflation Factors. The VIFs shown below examine the correlation
among all your potential predictors collectively. If you see VIFs over 10, then there exists
too much duplication in your data to include all variables in the model. Some argue
that you should limit VIFs to 2.5 or less for logistic regression.
For each regression, the software will recalculate the VIFs in your model so
you can better understand if collinearity will impact your results. Drop
variables one at a time that have high VIFs and rerun the regression. Highlighted below are those VIFs which
have a value of 10 or more.
Results of V.I.F.
Step 4 - Examine
Bivariate Correlations
& Means
Further still, it is often beneficial to look at how each predictor variable is
correlated with the dependent variable. These are sometimes referred to as "bivariate
correlations". The results box below collects this information. If the Pvalue is >.05, we would typically say that the correlation (and hence the variable)
is not important. Highlighted below are those bivariate correlations which are statistically
insignificant.
Results of Bivariate Correlations [Pvalues]
The means test below determines if there is a significant difference in the average
value of the attribute between the two outcome popluations. The first number is
the t-test, and to its right is the probability value (Pvalue) associated with the t-test.
If the t-stat is over 2 in absolute value (i.e. the Pvalue <.05),
then that attribute might be a good predictor
candidate in explaining the outcome variable. Highlighted below are those variables which were found to have statistically
the same means across the target population. It would be doubtful if these variables
would prove useful in your modeling exercise.
Results of Means Test Across Predictors
Graphs
The following information shows the minimums, maximums, and means of the two outcome
populations. The first graph is a frequency
count of the variable. The second graph is a picture of the variable
describing the distribution between the two populations. Good predictor variables will tend to show differences
between the red bars and the blue bars.
Weight of Evidence (WOE) and Information Value (IV)
The third graph shows two important measures of potential
predictiveness - Weight of Evidence and Information Value. The WOE is derived by
dividing each variable
into five equally divisible groups and calculating the ratio of what share each
bin has of the total 1's and 0's. Then we take their log. In practice, modeling
experts might start off by breaking the number of groups for each variable into
50 equally divisible bins (groups) and calculate the WOE for each group. Next, they
might fine tune the bins by collapsing some in order to get a monotonic trend, or
simply to better distinguish differences across bins from a WOE perspective. In
this program, we present five bins to give the user a rough idea of the nonlinear
nature of each variable. Although you can build a regression model more quickly
without worrying about the WOE procedure (i.e. use the variables in their continuous
'raw' form), scoring models can sometimes be significantly enhanced by using the
weight of evidence approach.
The WOE Report can be downloaded below which will show you the code for the
transformation as well as a list of the information value for all the variables.
The Information Value
(IV) is a similar measure to WOE, but describes the overall potential predictiveness of
the variable. Low values of IV will tend not to be predictive (values less than
.05). In fact, this program allows the user to employ IV measures
as a filtering alternative.
Min, Max, and Means of Outcome Population (1's)
Min, Max, and Means of Outcome Population (0's)
When you are finished examining data and graphs above, proceed to Step 5 and select
variables for your initial model.
Step 5 - Select Modeling Variables
Now it is time to select predictor variables for the model.
Try to select variables that are highly correlated with the dependent variable but
not too highly correlated with other predictors. Use the information from the correlation
analysis and means analysis above to make this initial selection. As PC based statistical
software such as SAS may offer an automatic variable selection routine such as stepwise regression,
server resources prevent such a feature from being implemented here. However, if the user selects
a low
IV criteria such as <.05, nonpredictive variables can be be
eliminated immediately.
Furthermore, use of this feature automatically eliminates (filters) explanatory
variables that are too correlated with one another (>.55),
a process that traditional stepwise
procedures do not address.
Select Predictors for Logistic Regression
Minimum Information Value (IV)
Maximum Allowed Pairwise Correlation =+/-.55
Step 6 - Build Model
Now you are ready to build (estimate) your initial model. Simply press
the button below and examine the statistical information that follows. Model building
is an iterative process, so you will want to revisit your variable selection to see
if you can build a model that is even more predictive and makes sense from a business
or statistical perspective. If your sample is not purely random because
you placed more 1's for you dependent variable than seen in the original population, please input the original population percentage of 1's so your score will reflect the correct adjusted probabilities. This adjustment simply
reflects a change to the intercept component of your model and not your estimated coefficients.
Below, you are also given the choice to also build a logistic regression using
the WOE transformations which were derived by using quintile breaks. These results present
a scorecard highlighting possible nonlinearity in your model.
Modeling Choice
For uneven
sampling scheme (leave blank otherwise)
Percent Population (1's)
%
Percent Sample (1's) %
Intercept Adjustment
0.00
Step 7 - Examine Modeling Results
The regression results below show the weights (coefficients)
for each predictor along with a test as to their statistical significance. Also
shown are the probability values for these statistics. P-values less than .05 imply
statistical significance along with t-values greater than 2 in absolute value. Also
provided are the log odds of the coefficients which are very useful in interpreting
the regression results. These values are simply the @exp (coefficients). For example,
if you have a binary explanatory variable where a value of "1" represents gender
= female and "0" otherwise, then a log odds value of 2.3 implies the chance of response
for females is 2.3 times as likely than for males, all other things remaining equal.
Finally, VIFs are also shown indicating potential multicollinearity in your data. In
logistic regression, you want to keep VIF values below 2.5.
Logistic Regression Prediction Equation:
Also shown below are general statistics associated with logistic regression. These
results show from an overall perspective if there is any predictive power in your
model. You want the Pvalue to be less than .05, which is usually pretty easy to
do in most models.
Other Regression Results
Loglikelihood Ratio
Loglikelihood Ratio P-Value
Model Sensitivity
The graph below shows how the probability value varies across the range of predictor
values assuming all other variables in the model are equal to their sample means.
This has the most meaning if your variables are continuous in nature rather than
constructed as dummy variables. The reason why you might want to examine these graphs
is
because it shows you where your model is most sensitive to small changes in certain
explanatory variables. (Note: you will notice that some of the modeling results disappear
when you change the graph selection. This is to conserve memory. Simply hit the
ESTIMATE MODEL button again in Step 6 to see your results.)
Logistic Regression (WOE) Prediction Equation:
Step 8 - Model Validation - Development Data
So, how good
is your model? Shown below
is what is traditionally called a LIFT CHART describing how well your model rank
orders your data. This table is calculated by first sorting the data based on the
model's probability value from high to low and then counting how many '1s' and '0s'
are found in each 10% bin or interval with respect to your
dependent variable. Calculated is
the difference between the cumulative number of '1s' and '0s' for each of the 10% intervals.
The maximum difference
across the 10 buckets is called the KS value. The higher the KS (with a maximum
equal to 100), the more predictive your model. Look to increase this more and more
in subsequent models. The earlier
this maximum occurs in terms of deciles, the better. Best practices dictate that
a validation anlaysis be done on the data used for model development. It is also
best practice to "hold out" a portion of the data from model development to see
how the score performs on an independent sample. The "hold out" validation is optional
here and can be developed in Step 9.
Shown below is a graph of the cumulative % of 1's population (top) and the cumulative
% of 0's in your validation table. The greater the spread between these lines, the
better your model is able to distinguish between the two populations of interest.
Model building is seldom a one time event. Typically as many as 5 to 10 different
models are developed before a final
one is selected. The list below collects the results of each model along with the
validation results for your review until you
select your final model. To clear the list, simply hit the button below and begin
your modeling efforts again.
Model History
The results below show the model's probability (score) for all records in the data
along with the outcome (dependent variable) listed in the same sequence as the original
input file. Highlighted are those records that fall in the first decile - i.e. the
group with the highest (top 10%) score.
Probability Score and Outcome for Development Sample
Step
9 - Model Validation - Hold Out Sample (Optional)
After you come up with your final model above, it is
recommended that you perform an independent validation using hold out data. The
data should be in the
exact same form as your original model development data, with a
target outcome variable depicted as a 0 or 1 in the first field, and potential predictor
variables following afterwards. If your development
data had 20 potential predictor variables in it, then your holdout sample should
have the same 20 variables in it and in the same order, regardless of the number
of variables in your final model. Your maximum KS value should be very close as to
what you obtained for your development dataset. If you coded any missing values
as -999, then a mean substitution procedure will be applied. Please use the following buttons
to load your hold out sample for validation analysis.
Read In Results:
Missing Data Results (coded as -999)
Variable Name (missing obs) [percent missing]
If you are satisfied that your data has been read in correctly, then press the button
below to create your validation results for the hold out sample. The results will
be shown in the table and graph below.
Probability Score and Outcome for Holdout Sample
Step 10 - Profile the Results (Development Sample)
The next step is to describe in simple terms what the potential
responders and nonresponders look like or those individuals predicted to have good
or bad payment behavior. This can be accomplished by simply taking the means of
those accounts that fall in the top and bottom deciles as shown below.
Profile Means 1st Decile (high scores) Profile Means Last Decile (low scores)
Step 11 - Apply the Model to New Data
If your modeling
efforts were successful, the next step might be to gather new data associated
with those variables that were found statistically important and score them with the
prediction algorithm. This can easily be done in Excel if you download and use the
logistic equation scoring code in Step 7. In applying the model to production data, it is recommended you have some
code in place to cap attribute values (say to some maximum value - 99th percentile) to prevent errors
in data preparation for production.
Also, some code should be added to handle any missing data. Finally, make sure that
the population you are scoring "looks the same" from a distributional standpoint
as the data from which your model was developed, or your results may be poor.
Thanks
for visiting Jeff Morrison's "Logistic Regression ONLINE"!