Analytics: Machine Learning Data Analysis: Lasso Regression Analysis

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository. The objective was to complete a LASSO Regression analysis using the GLMSELECT procedure, as a data mining technique to predict the probability of a person defaulting on their credit card payments.

ATTRIBUTE INFORMATION:

There are 30,000 observations.
The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable.
There are 23 explanatory variables:
LIMIT_BAL: Amount of the given credit (dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX: (1 = male; 2 = female).
EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: (1 = married; 2 = single; 3 = others).
AGE: (year).
History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;

SAS CODE

***********************************************************************

READ IN THE DATA

***********************************************************************;

PROC IMPORT DATAFILE="/home/mst07221/default of credit card clients.xls"

DBMS=XLS

OUT=WORK.creditcard;

GETNAMES=YES;

RUN;

**********************************************************************

DATA MANAGEMENT

*********************************************************************;

DATA NEW;

set work.creditcard;

**if bio_sex=1 then male=1;

**if bio_sex=2 then male=0;

* delete observations with missing data;

if cmiss(of _all_) then delete;

RUN;

ods graphics on;

***********************************************************************

Split data randomly into test and training data

SRS = simple random sampling, selects units with equal probability and without replacement.

OUTALL = keeps all observations and flags those being selected as training data - 70%(non-flagged = testing data).

***********************************************************************;

PROC SURVEYSELECT DATA=NEW OUT=traintest SEED=123

SAMPRATE=0.7

METHOD=SRS

OUTALL;

RUN;

**********************************************************************

PROC GLMSELECT automatically standardizes the predictor variables (mean=0, std=1).

LASSO multiple regression with LARS algorithm k=10 fold validation

The LAR (Least Angle Regression) algorithm produces a sequence of regression models where one parameter is

added at each step, terminating at the full least-squares solution when all parameters have entered the model.

CHOOSE=CV, the k−fold cross validation predicted residual sum of squares

STOP=NONE, the selection proceeds until all the specified effects are in the model.

CVMETHOD=RANDOM(10), requests 10-fold cross validation where the training data is randomly partitioned into 10 subsets

*********************************************************************;

PROC GLMSELECT DATA=traintest

PLOTS=all

SEED=123;

PARTITION ROLE=selected(train='1' test='0');

MODEL default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR / SELECTION=LAR(CHOOSE=CV STOP=NONE) CVMETHOD=RANDOM(10);

RUN;

RESULTS

The SURVEYSELECT Procedure

Selection Method	Simple Random Sampling

Input Data Set	NEW
Random Number Seed	123
Sampling Rate	0.7
Sample Size	21000
Selection Probability	0.7
Sampling Weight	0
Output Data Set	TRAINTEST

The GLMSELECT Procedure

Data Set	WORK.TRAINTEST
Dependent Variable	default_payment_next_month
Selection Method	LAR
Stop Criterion	None
Choose Criterion	Cross Validation
Cross Validation Method	Random
Cross Validation Fold	10
Effect Hierarchy Enforced	None
Random Number Seed	123

Number of Observations Read	30000
Number of Observations Used	30000
Number of Observations Used for Training	21000
Number of Observations Used for Testing	9000

Dimensions
Number of Effects	24
Number of Parameters	24

The GLMSELECT Procedure

LAR Selection Summary
Step	Effect Entered	Number Effects In	ASE	Test ASE	CV PRESS
* Optimal Value of Criterion
0	Intercept	1	0.1726	0.1715	3625.1576
1	REPAY_SEP	2	0.1568	0.1582	3218.0842
2	REPAY_AUG	3	0.1545	0.1565	3203.9536
3	REPAY_JUL	4	0.1536	0.1559	3200.9871
4	LIMIT_BAL	5	0.1534	0.1558	3192.1540
5	BILL_SEP	6	0.1527	0.1552	3169.6074
6	REPAY_JUN	7	0.1519	0.1546	3168.9155
7	PAY_AMT_SEP	8	0.1517	0.1545	3165.0708
8	PAY_AMT_AUG	9	0.1512	0.1541	3162.7693
9	MARRIAGE	10	0.1509	0.1540	3158.8279
10	PAY_AMT_JUN	11	0.1509	0.1539	3158.1246
11	AGE	12	0.1506	0.1538	3156.8583
12	SEX	13	0.1505	0.1537	3155.7965
13	PAY_AMT_APR	14	0.1505	0.1537	3155.7168
14	PAY_AMT_MAY	15	0.1504	0.1537	3155.8478
15	EDUCATION	16	0.1501	0.1536	3153.4683*
16	REPAY_MAY	17	0.1501	0.1536	3153.9803
17	PAY_AMT_JUL	18	0.1499	0.1537	3154.1841
18	BILL_APR	19	0.1499	0.1537	3153.8668
19	BILL_JUL	20	0.1499	0.1538	3153.7013
20	REPAY_APR	21	0.1499	0.1539	3154.3260
21	BILL_MAY	22	0.1499	0.1539	3154.9558
22	BILL_JUN	23	0.1499	0.1539	3155.3386
23	BILL_AUG	24	0.1499	0.1539	3155.5273

Selection stopped because all effects are in the final model.

Panel showing how the standardized coefficients and the CHOOSE= criterion change with the effect sequence.

Panel showing how selection criteria change with the selection step.

Plot showing how the average squared error for the training, validation, and test data change with the effect sequence.

The GLMSELECT Procedure

Selected Model

The selected model, based on Cross Validation, is the model at Step 15.

Effects:	Intercept LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN BILL_SEP PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value
Model	15	472.27194	31.48480	209.57
Error	20984	3152.53930	0.15024
Corrected Total	20999	3624.81124

Root MSE	0.38760
Dependent Mean	1.22181
R-Square	0.1303
Adj R-Sq	0.1297
AIC	-18789
AICC	-18789
SBC	-39663
ASE (Train)	0.15012
ASE (Test)	0.15360
CV PRESS	3153.46826

Parameter Estimates
Parameter	DF	Estimate
Intercept	1	1.284516
LIMIT_BAL	1	-3.13948E-8
SEX	1	-0.006750
EDUCATION	1	-0.005595
MARRIAGE	1	-0.015051
AGE	1	0.000651
REPAY_SEP	1	0.096354
REPAY_AUG	1	0.018638
REPAY_JUL	1	0.011130
REPAY_JUN	1	0.010095
BILL_SEP	1	-0.000000395
PAY_AMT_SEP	1	-0.000000547
PAY_AMT_AUG	1	-0.000000467
PAY_AMT_JUN	1	-0.000000261
PAY_AMT_MAY	1	-9.216282E-8
PAY_AMT_APR	1	-8.440351E-8

SUMMARY OF FINDINGS

A lasso regression analysis was conducted to identify a subset of variables from a pool of 23 predictor variables that best predicted a quantitative response variable measuring the possibility of a credit card payment default.
Categorical predictors included: SEX EDUCATION MARRIAGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR.
Quantitative predictor variables included: LIMIT_BAL AGE BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR.
All predictor variables were standardized to have a mean of zero and a standard deviation of one in PROC GLMSELECT.
Data were randomly split into a training set that included 70% of the observations (N=21,000) and a test set that included 30% of the observations (N=9,000).
The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
The LAR Selection Summary table shows the average square error (ASE) declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. This is the bias variance trade off.
The variable at step 15 of the model (Education) has the lowest CVPRESS value (lowest sum of the residual sum of squares in the test data set), indicating that it is the best model selected by the procedure. At this point, the bias and the variance in the test prediction error is lowest. This is evident in the Coefficient Progression and Fit Criteria plots.
The Coefficient Progression plot shows REPAY_SEP (the ability to make a credit card payment in September) had the largest regression coefficient, and therefore the most impact on one’s ability to default on next month's credit card payment. The relative importance of this predictor variable is significantly higher than the other predictor variables. 7 of the predictor variables are negatively associated with the possibility of a credit card payment default.

Reference

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Analytics

Saturday, May 7, 2016

Machine Learning Data Analysis: Lasso Regression Analysis

No comments:

Post a Comment