ATTRIBUTE INFORMATION:
- There are 30,000 observations.
- The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable.
- There are 23 explanatory variables:
- LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- SEX: (1 = male; 2 = female).
- EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- MARRIAGE: (1 = married; 2 = single; 3 = others).
- AGE: (year).
- History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
- Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;
SAS CODE
/* Source File: default of credit card clients.xls *//* Source Path: /home/mst07221 */
%web_drop_table(WORK.IMPORT1);
FILENAME REFFILE "/home/mst07221/default of credit card clients.xls" TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
DBMS=XLS
OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;
PROC SORT DATA=WORK.IMPORT1;
ods graphics on;
PROC HPSPLIT SEED=15531;
CLASS default_payment_next_month SEX EDUCATION MARRIAGE AGE
REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR;
MODEL default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE
AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR
BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP
PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR;
GROW ENTROPY;
PRUNE COSTCOMPLEXITY;
RUN;
%web_open_table(WORK.IMPORT1);
RESULTS
The HPSPLIT Procedure
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 2 |
Data Access Information | |||
---|---|---|---|
Data | Engine | Role | Path |
WORK.IMPORT1 | V9 | Input | On Client |
Model Information | |
---|---|
Split Criterion Used | Entropy |
Pruning Method | Cost-Complexity |
Subtree Evaluation Criterion | Cost-Complexity |
Number of Branches | 2 |
Maximum Tree Depth Requested | 10 |
Maximum Tree Depth Achieved | 10 |
Tree Depth | 7 |
Number of Leaves Before Pruning | 370 |
Number of Leaves After Pruning | 18 |
Model Event Level | 1 |
Number of Observations Read | 30000 |
---|---|
Number of Observations Used | 30000 |
The HPSPLIT Procedure
The HPSPLIT Procedure
The HPSPLIT Procedure
Model-Based Confusion Matrix | |||
---|---|---|---|
Actual | Predicted | Error Rate | |
1 | 2 | ||
1 | 22281 | 1083 | 0.0464 |
2 | 4142 | 2494 | 0.6242 |
Model-Based Fit Statistics for Selected Tree | ||||||||
---|---|---|---|---|---|---|---|---|
N Leaves | ASE | Mis- class | Sensitivity | Specificity | Entropy | Gini | RSS | AUC |
18 | 0.1368 | 0.1742 | 0.9536 | 0.3758 | 0.6358 | 0.2736 | 8207.6 | 0.7302 |
Variable Importance | ||||
---|---|---|---|---|
Variable | Variable Label | Training | Count | |
Relative | Importance | |||
REPAY_SEP | REPAY_SEP | 1.0000 | 39.9093 | 2 |
REPAY_AUG | REPAY_AUG | 0.4417 | 17.6297 | 1 |
AGE | AGE | 0.2023 | 8.0733 | 6 |
BILL_SEP | BILL_SEP | 0.1953 | 7.7932 | 1 |
REPAY_MAY | REPAY_MAY | 0.1458 | 5.8186 | 3 |
REPAY_APR | REPAY_APR | 0.1291 | 5.1537 | 2 |
REPAY_JUN | REPAY_JUN | 0.1199 | 4.7833 | 1 |
REPAY_JUL | REPAY_JUL | 0.0998 | 3.9824 | 1 |
- All the 30,000 observations were used in the model because there were no missing values in the predictor variables.
- The Cost-Complexity Analysis table shows that the average square error (ASE) is minimized when the number of leaves is 31.
- The Confusion Matrix shows a prediction accuracy of 95.4% (1-0.0464) for people that will not default on the next month's payment. This is also the measure of sensitivity in the model.
- The prediction accuracy of those expected to default on the next month's payments is much lower at 37.6%(1-0.6242).This is the specificity measure of the model.
- The Model-Based Fit Statistics table shows a misclassification rate of 0.1742 ((1,083 + 4,142 )/ 30,000). These are thenumber of people who were misclassified in the Confusion Matrix.
- The ROC Curve has an AUC measure of 0.73, indicating that the training model does not fit the data perfectly (AUC=1).
- The Variable of Importance table shows that the predictors REPAY_SEP and REPAY_AUG have the largest importance values. They are the most important consideration in determining if a person will default in the next credit card payment.
No comments:
Post a Comment