ATTRIBUTE INFORMATION:
- There are 30,000 observations.
- The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable.
- There are 23 explanatory variables:
- LIMIT_BAL: Amount of the given credit (dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- SEX: (1 = male; 2 = female).
- EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- MARRIAGE: (1 = married; 2 = single; 3 = others).
- AGE: (year).
- History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
- Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;
SAS CODE
***********************************************************************
READ IN THE DATA
***********************************************************************;
PROC IMPORT DATAFILE="/home/mst07221/default of credit card clients.xls"
DBMS=XLS
OUT=WORK.creditcard;
GETNAMES=YES;
RUN;
**********************************************************************
DATA MANAGEMENT
*********************************************************************;
DATA NEW;
set work.creditcard;
**if bio_sex=1 then male=1;
**if bio_sex=2 then male=0;
* delete observations with missing data;
if cmiss(of _all_) then delete;
RUN;
ods graphics on;
***********************************************************************
Split data randomly into test and training data
SRS = simple random sampling, selects units with equal probability and without replacement.
OUTALL = keeps all observations and flags those being selected as training data - 70%(non-flagged = testing data).
***********************************************************************;
PROC SURVEYSELECT DATA=NEW OUT=traintest SEED=123
SAMPRATE=0.7
METHOD=SRS
OUTALL;
RUN;
**********************************************************************
PROC GLMSELECT automatically standardizes the predictor variables (mean=0, std=1).
LASSO multiple regression with LARS algorithm k=10 fold validation
The LAR (Least Angle Regression) algorithm produces a sequence of regression models where one parameter is
added at each step, terminating at the full least-squares solution when all parameters have entered the model.
CHOOSE=CV, the k−fold cross validation predicted residual sum of squares
STOP=NONE, the selection proceeds until all the specified effects are in the model.
CVMETHOD=RANDOM(10), requests 10-fold cross validation where the training data is randomly partitioned into 10 subsets
*********************************************************************;
PROC GLMSELECT DATA=traintest
PLOTS=all
SEED=123;
PARTITION ROLE=selected(train='1' test='0');
MODEL default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR / SELECTION=LAR(CHOOSE=CV STOP=NONE) CVMETHOD=RANDOM(10);
RUN;
RESULTS
The SURVEYSELECT Procedure
Selection Method | Simple Random Sampling |
---|
Input Data Set | NEW |
---|---|
Random Number Seed | 123 |
Sampling Rate | 0.7 |
Sample Size | 21000 |
Selection Probability | 0.7 |
Sampling Weight | 0 |
Output Data Set | TRAINTEST |
The GLMSELECT Procedure
Data Set | WORK.TRAINTEST |
---|---|
Dependent Variable | default_payment_next_month |
Selection Method | LAR |
Stop Criterion | None |
Choose Criterion | Cross Validation |
Cross Validation Method | Random |
Cross Validation Fold | 10 |
Effect Hierarchy Enforced | None |
Random Number Seed | 123 |
Number of Observations Read | 30000 |
---|---|
Number of Observations Used | 30000 |
Number of Observations Used for Training | 21000 |
Number of Observations Used for Testing | 9000 |
Dimensions | |
---|---|
Number of Effects | 24 |
Number of Parameters | 24 |
The GLMSELECT Procedure
LAR Selection Summary | |||||
---|---|---|---|---|---|
Step | Effect Entered | Number Effects In | ASE | Test ASE | CV PRESS |
* Optimal Value of Criterion | |||||
0 | Intercept | 1 | 0.1726 | 0.1715 | 3625.1576 |
1 | REPAY_SEP | 2 | 0.1568 | 0.1582 | 3218.0842 |
2 | REPAY_AUG | 3 | 0.1545 | 0.1565 | 3203.9536 |
3 | REPAY_JUL | 4 | 0.1536 | 0.1559 | 3200.9871 |
4 | LIMIT_BAL | 5 | 0.1534 | 0.1558 | 3192.1540 |
5 | BILL_SEP | 6 | 0.1527 | 0.1552 | 3169.6074 |
6 | REPAY_JUN | 7 | 0.1519 | 0.1546 | 3168.9155 |
7 | PAY_AMT_SEP | 8 | 0.1517 | 0.1545 | 3165.0708 |
8 | PAY_AMT_AUG | 9 | 0.1512 | 0.1541 | 3162.7693 |
9 | MARRIAGE | 10 | 0.1509 | 0.1540 | 3158.8279 |
10 | PAY_AMT_JUN | 11 | 0.1509 | 0.1539 | 3158.1246 |
11 | AGE | 12 | 0.1506 | 0.1538 | 3156.8583 |
12 | SEX | 13 | 0.1505 | 0.1537 | 3155.7965 |
13 | PAY_AMT_APR | 14 | 0.1505 | 0.1537 | 3155.7168 |
14 | PAY_AMT_MAY | 15 | 0.1504 | 0.1537 | 3155.8478 |
15 | EDUCATION | 16 | 0.1501 | 0.1536 | 3153.4683* |
16 | REPAY_MAY | 17 | 0.1501 | 0.1536 | 3153.9803 |
17 | PAY_AMT_JUL | 18 | 0.1499 | 0.1537 | 3154.1841 |
18 | BILL_APR | 19 | 0.1499 | 0.1537 | 3153.8668 |
19 | BILL_JUL | 20 | 0.1499 | 0.1538 | 3153.7013 |
20 | REPAY_APR | 21 | 0.1499 | 0.1539 | 3154.3260 |
21 | BILL_MAY | 22 | 0.1499 | 0.1539 | 3154.9558 |
22 | BILL_JUN | 23 | 0.1499 | 0.1539 | 3155.3386 |
23 | BILL_AUG | 24 | 0.1499 | 0.1539 | 3155.5273 |
Selection stopped because all effects are in the final model. |
The GLMSELECT Procedure
Selected Model
The selected model, based on Cross Validation, is the model at Step 15.
Effects: | Intercept LIMIT_BAL SEX EDUCATION MARRIAGE AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN BILL_SEP PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR |
---|
Analysis of Variance | ||||
---|---|---|---|---|
Source | DF | Sum of Squares | Mean Square | F Value |
Model | 15 | 472.27194 | 31.48480 | 209.57 |
Error | 20984 | 3152.53930 | 0.15024 | |
Corrected Total | 20999 | 3624.81124 |
Root MSE | 0.38760 |
---|---|
Dependent Mean | 1.22181 |
R-Square | 0.1303 |
Adj R-Sq | 0.1297 |
AIC | -18789 |
AICC | -18789 |
SBC | -39663 |
ASE (Train) | 0.15012 |
ASE (Test) | 0.15360 |
CV PRESS | 3153.46826 |
Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | 1.284516 |
LIMIT_BAL | 1 | -3.13948E-8 |
SEX | 1 | -0.006750 |
EDUCATION | 1 | -0.005595 |
MARRIAGE | 1 | -0.015051 |
AGE | 1 | 0.000651 |
REPAY_SEP | 1 | 0.096354 |
REPAY_AUG | 1 | 0.018638 |
REPAY_JUL | 1 | 0.011130 |
REPAY_JUN | 1 | 0.010095 |
BILL_SEP | 1 | -0.000000395 |
PAY_AMT_SEP | 1 | -0.000000547 |
PAY_AMT_AUG | 1 | -0.000000467 |
PAY_AMT_JUN | 1 | -0.000000261 |
PAY_AMT_MAY | 1 | -9.216282E-8 |
PAY_AMT_APR | 1 | -8.440351E-8 |
- A lasso regression analysis was conducted to identify a subset of variables from a pool of 23 predictor variables that best predicted a quantitative response variable measuring the possibility of a credit card payment default.
- Categorical predictors included: SEX EDUCATION MARRIAGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR.
- Quantitative predictor variables included: LIMIT_BAL AGE BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR.
- All predictor variables were standardized to have a mean of zero and a standard deviation of one in PROC GLMSELECT.
- Data were randomly split into a training set that included 70% of the observations (N=21,000) and a test set that included 30% of the observations (N=9,000).
- The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
- The LAR Selection Summary table shows the average square error (ASE) declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. This is the bias variance trade off.
- The variable at step 15 of the model (Education) has the lowest CVPRESS value (lowest sum of the residual sum of squares in the test data set), indicating that it is the best model selected by the procedure. At this point, the bias and the variance in the test prediction error is lowest. This is evident in the Coefficient Progression and Fit Criteria plots.
- The Coefficient Progression plot shows REPAY_SEP (the ability to make a credit card payment in September) had the largest regression coefficient, and therefore the most impact on one’s ability to default on next month's credit card payment. The relative importance of this predictor variable is significantly higher than the other predictor variables. 7 of the predictor variables are negatively associated with the possibility of a credit card payment default.
Reference
No comments:
Post a Comment