Analytics: Machine Learning Data Analysis: Decision Trees

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository. The objective was to use decision trees as a data mining technique to predict the probability of a person defaulting on their credit card payments. This evaluation is useful in credit scoring models.

ATTRIBUTE INFORMATION:

There are 30,000 observations.
The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable.
There are 23 explanatory variables:
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX: (1 = male; 2 = female).
EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: (1 = married; 2 = single; 3 = others).
AGE: (year).
History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;

SAS CODE

/* Source File: default of credit card clients.xls */
/* Source Path: /home/mst07221 */

%web_drop_table(WORK.IMPORT1);

FILENAME REFFILE "/home/mst07221/default of credit card clients.xls" TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE
DBMS=XLS
OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;

PROC SORT DATA=WORK.IMPORT1;

ods graphics on;

PROC HPSPLIT SEED=15531;
CLASS default_payment_next_month SEX EDUCATION MARRIAGE AGE
REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR;

MODEL default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE
AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR
BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP
PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR;

GROW ENTROPY;
PRUNE COSTCOMPLEXITY;

RUN;

%web_open_table(WORK.IMPORT1);

RESULTS

The HPSPLIT Procedure

Performance Information
Execution Mode	Single-Machine
Number of Threads	2

Data Access Information
Data	Engine	Role	Path
WORK.IMPORT1	V9	Input	On Client

Model Information
Split Criterion Used	Entropy
Pruning Method	Cost-Complexity
Subtree Evaluation Criterion	Cost-Complexity
Number of Branches	2
Maximum Tree Depth Requested	10
Maximum Tree Depth Achieved	10
Tree Depth	7
Number of Leaves Before Pruning	370
Number of Leaves After Pruning	18
Model Event Level	1

Number of Observations Read	30000
Number of Observations Used	30000

The HPSPLIT Procedure

Plot of Cross Validation Average ASE, including error estimates, while varying pruning parameters for default_payment_next_month

The HPSPLIT Procedure

Tree Overview Plot for default_payment_next_month

Subtree Detail Plot for default_payment_next_month starting at node 0 down to depth 3

The HPSPLIT Procedure

Model-Based Confusion Matrix
Actual	Predicted		Error Rate
Actual	1	2	Error Rate
1	22281	1083	0.0464
2	4142	2494	0.6242

Model-Based Fit Statistics for Selected Tree
N Leaves	ASE	Mis- class	Sensitivity	Specificity	Entropy	Gini	RSS	AUC
18	0.1368	0.1742	0.9536	0.3758	0.6358	0.2736	8207.6	0.7302

Receiver Operating Characteristic (ROC) Curve for default_payment_next_month

Variable Importance
Variable	Variable Label	Training		Count
Variable	Variable Label	Relative	Importance	Count
REPAY_SEP	REPAY_SEP	1.0000	39.9093	2
REPAY_AUG	REPAY_AUG	0.4417	17.6297	1
AGE	AGE	0.2023	8.0733	6
BILL_SEP	BILL_SEP	0.1953	7.7932	1
REPAY_MAY	REPAY_MAY	0.1458	5.8186	3
REPAY_APR	REPAY_APR	0.1291	5.1537	2
REPAY_JUN	REPAY_JUN	0.1199	4.7833	1
REPAY_JUL	REPAY_JUL	0.0998	3.9824	1

SUMMARY OF FINDINGS

All the 30,000 observations were used in the model because there were no missing values in the predictor variables.
The Cost-Complexity Analysis table shows that the average square error (ASE) is minimized when the number of leaves is 31.
The Confusion Matrix shows a prediction accuracy of 95.4% (1-0.0464) for people that will not default on the next month's payment. This is also the measure of sensitivity in the model.
The prediction accuracy of those expected to default on the next month's payments is much lower at 37.6%(1-0.6242).This is the specificity measure of the model.
The Model-Based Fit Statistics table shows a misclassification rate of 0.1742 ((1,083 + 4,142 )/ 30,000). These are thenumber of people who were misclassified in the Confusion Matrix.
The ROC Curve has an AUC measure of 0.73, indicating that the training model does not fit the data perfectly (AUC=1).
The Variable of Importance table shows that the predictors REPAY_SEP and REPAY_AUG have the largest importance values. They are the most important consideration in determining if a person will default in the next credit card payment.

Reference

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Analytics

Sunday, April 24, 2016

Machine Learning Data Analysis: Decision Trees

No comments:

Post a Comment