Saturday, May 7, 2016

Machine Learning Data Analysis: Lasso Regression Analysis

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository.  The objective was to complete a LASSO Regression analysis using the GLMSELECT procedureas a data mining technique to predict the probability of a person defaulting on their credit card payments. 

ATTRIBUTE INFORMATION:
  • There are 30,000 observations.
  • The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable. 
  • There are 23 explanatory variables: 
  • LIMIT_BAL: Amount of the given credit (dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.  
  • SEX: (1 = male; 2 = female). 
  • EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
  • MARRIAGE:  (1 = married; 2 = single; 3 = others). 
  • AGE:  (year).
  • History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
  • Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
  • Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;

SAS CODE

***********************************************************************
READ IN THE DATA
***********************************************************************;
PROC IMPORT DATAFILE="/home/mst07221/default of credit card clients.xls"
DBMS=XLS
OUT=WORK.creditcard;
GETNAMES=YES;
RUN;

**********************************************************************
DATA MANAGEMENT
*********************************************************************;
DATA NEW;
set work.creditcard;
**if bio_sex=1 then male=1;
**if bio_sex=2 then male=0;
 
* delete observations with missing data;
 if cmiss(of _all_) then delete;
 RUN;

ods graphics on;

***********************************************************************
Split data randomly into test and training data
SRS = simple random sampling, selects units with equal probability and without replacement. 
OUTALL = keeps all observations and flags those being selected as training data - 70%(non-flagged = testing data).
***********************************************************************;
PROC SURVEYSELECT DATA=NEW OUT=traintest SEED=123
  SAMPRATE=0.7 
  METHOD=SRS 
  OUTALL;
RUN;   

**********************************************************************
PROC GLMSELECT automatically standardizes the predictor variables (mean=0, std=1).
LASSO multiple regression with LARS algorithm k=10 fold validation
The LAR (Least Angle Regression) algorithm produces a sequence of regression models where one parameter is 
added at each step, terminating at the full least-squares solution when all parameters have entered the model.
CHOOSE=CV, the k−fold cross validation predicted residual sum of squares
STOP=NONE, the selection proceeds until all the specified effects are in the model.
CVMETHOD=RANDOM(10), requests 10-fold cross validation where the training data is randomly partitioned into 10 subsets
 *********************************************************************;
PROC GLMSELECT DATA=traintest 
  PLOTS=all
  SEED=123;
  PARTITION ROLE=selected(train='1' test='0');

MODEL default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR / SELECTION=LAR(CHOOSE=CV STOP=NONE) CVMETHOD=RANDOM(10);

RUN;

RESULTS

The SURVEYSELECT Procedure
Selection MethodSimple Random Sampling
Input Data SetNEW
Random Number Seed123
Sampling Rate0.7
Sample Size21000
Selection Probability0.7
Sampling Weight0
Output Data SetTRAINTEST

The GLMSELECT Procedure
Data SetWORK.TRAINTEST
Dependent Variabledefault_payment_next_month
Selection MethodLAR
Stop CriterionNone
Choose CriterionCross Validation
Cross Validation MethodRandom
Cross Validation Fold10
Effect Hierarchy EnforcedNone
Random Number Seed123
Number of Observations Read30000
Number of Observations Used30000
Number of Observations Used for Training21000
Number of Observations Used for Testing9000
Dimensions
Number of Effects24
Number of Parameters24

The GLMSELECT Procedure
LAR Selection Summary
StepEffect
Entered
Number
Effects In
ASETest ASECV PRESS
* Optimal Value of Criterion
0Intercept10.17260.17153625.1576
1REPAY_SEP20.15680.15823218.0842
2REPAY_AUG30.15450.15653203.9536
3REPAY_JUL40.15360.15593200.9871
4LIMIT_BAL50.15340.15583192.1540
5BILL_SEP60.15270.15523169.6074
6REPAY_JUN70.15190.15463168.9155
7PAY_AMT_SEP80.15170.15453165.0708
8PAY_AMT_AUG90.15120.15413162.7693
9MARRIAGE100.15090.15403158.8279
10PAY_AMT_JUN110.15090.15393158.1246
11AGE120.15060.15383156.8583
12SEX130.15050.15373155.7965
13PAY_AMT_APR140.15050.15373155.7168
14PAY_AMT_MAY150.15040.15373155.8478
15EDUCATION160.15010.15363153.4683*
16REPAY_MAY170.15010.15363153.9803
17PAY_AMT_JUL180.14990.15373154.1841
18BILL_APR190.14990.15373153.8668
19BILL_JUL200.14990.15383153.7013
20REPAY_APR210.14990.15393154.3260
21BILL_MAY220.14990.15393154.9558
22BILL_JUN230.14990.15393155.3386
23BILL_AUG240.14990.15393155.5273
Selection stopped because all effects are in the final model.
Panel showing how the standardized coefficients and the CHOOSE= criterion change with the effect sequence.
Panel showing how selection criteria change with the selection step.
Plot showing how the average squared error for the training, validation, and test data change with the effect sequence.

The GLMSELECT Procedure
Selected Model
The selected model, based on Cross Validation, is the model at Step 15.
Effects:Intercept LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN BILL_SEP PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F Value
Model15472.2719431.48480209.57
Error209843152.539300.15024 
Corrected Total209993624.81124  
Root MSE0.38760
Dependent Mean1.22181
R-Square0.1303
Adj R-Sq0.1297
AIC-18789
AICC-18789
SBC-39663
ASE (Train)0.15012
ASE (Test)0.15360
CV PRESS3153.46826
Parameter Estimates
ParameterDFEstimate
Intercept11.284516
LIMIT_BAL1-3.13948E-8
SEX1-0.006750
EDUCATION1-0.005595
MARRIAGE1-0.015051
AGE10.000651
REPAY_SEP10.096354
REPAY_AUG10.018638
REPAY_JUL10.011130
REPAY_JUN10.010095
BILL_SEP1-0.000000395
PAY_AMT_SEP1-0.000000547
PAY_AMT_AUG1-0.000000467
PAY_AMT_JUN1-0.000000261
PAY_AMT_MAY1-9.216282E-8
PAY_AMT_APR1-8.440351E-8
SUMMARY OF FINDINGS
  • A lasso regression analysis was conducted to identify a subset of variables from a pool of 23 predictor variables that best predicted a quantitative response variable measuring the possibility of a credit card payment default. 
  • Categorical predictors included: SEX EDUCATION MARRIAGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR.
  • Quantitative predictor variables included: LIMIT_BAL AGE BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR.
  • All predictor variables were standardized to have a mean of zero and a standard deviation of one in PROC GLMSELECT.
  • Data were randomly split into a training set that included 70% of the observations (N=21,000) and a test set that included 30% of the observations (N=9,000).
  • The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
  • The LAR Selection Summary table shows the average square error (ASE) declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. This is the bias variance trade off.
  • The variable at step 15 of the model (Education) has the lowest CVPRESS value (lowest sum of the residual sum of squares in the test data set), indicating that it is the best model selected by the procedure. At this point, the bias and the variance in the test prediction error is lowest. This is evident in the Coefficient Progression and Fit Criteria plots.
  • The Coefficient Progression plot shows REPAY_SEP (the ability to make a credit card payment in September) had the largest regression coefficient, and therefore the most impact on one’s ability to default on next month's credit card payment. The relative importance of this predictor variable is significantly higher than the other predictor variables. 7 of the predictor variables are negatively associated with the possibility of a credit card payment default.



Reference
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

No comments:

Post a Comment