Sunday, April 24, 2016

Machine Learning Data Analysis: Decision Trees

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository.  The objective was to use decision trees as a data mining technique to predict the probability of a person defaulting on their credit card payments. This evaluation is useful in credit scoring models.

ATTRIBUTE INFORMATION:
  • There are 30,000 observations.
  • The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable. 
  • There are 23 explanatory variables: 
  • LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.  
  • SEX: (1 = male; 2 = female). 
  • EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
  • MARRIAGE:  (1 = married; 2 = single; 3 = others). 
  • AGE:  (year).
  • History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
  • Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
  • Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;



SAS CODE
/* Source File: default of credit card clients.xls */
/* Source Path: /home/mst07221 */

%web_drop_table(WORK.IMPORT1);


FILENAME REFFILE "/home/mst07221/default of credit card clients.xls" TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE
DBMS=XLS
OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;


PROC SORT DATA=WORK.IMPORT1;


ods graphics on;

PROC HPSPLIT SEED=15531;
CLASS default_payment_next_month SEX EDUCATION MARRIAGE AGE
REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR;

MODEL  default_payment_next_month = LIMIT_BAL SEX EDUCATION MARRIAGE 
AGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR 
BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP 
PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR;

GROW ENTROPY;
PRUNE COSTCOMPLEXITY;

RUN;

%web_open_table(WORK.IMPORT1);




RESULTS


The HPSPLIT Procedure

Performance Information
Execution ModeSingle-Machine
Number of Threads2
Data Access Information
DataEngineRolePath
WORK.IMPORT1V9InputOn Client
Model Information
Split Criterion UsedEntropy
Pruning MethodCost-Complexity
Subtree Evaluation CriterionCost-Complexity
Number of Branches2
Maximum Tree Depth Requested10
Maximum Tree Depth Achieved10
Tree Depth7
Number of Leaves Before Pruning370
Number of Leaves After Pruning18
Model Event Level1
Number of Observations Read30000
Number of Observations Used30000


The HPSPLIT Procedure

Plot of Cross Validation Average ASE, including error estimates, while varying pruning parameters for default_payment_next_month


The HPSPLIT Procedure

Tree Overview Plot for default_payment_next_month
Subtree Detail Plot for default_payment_next_month starting at node 0 down to depth 3


The HPSPLIT Procedure

Model-Based Confusion Matrix
ActualPredictedError
Rate
12
12228110830.0464
2414224940.6242
Model-Based Fit Statistics for Selected Tree
N
Leaves
ASEMis-
class
SensitivitySpecificityEntropyGiniRSSAUC
180.13680.17420.95360.37580.63580.27368207.60.7302
Receiver Operating Characteristic (ROC) Curve for default_payment_next_month
Variable Importance
VariableVariable
Label
TrainingCount
RelativeImportance
REPAY_SEPREPAY_SEP1.000039.90932
REPAY_AUGREPAY_AUG0.441717.62971
AGEAGE0.20238.07336
BILL_SEPBILL_SEP0.19537.79321
REPAY_MAYREPAY_MAY0.14585.81863
REPAY_APRREPAY_APR0.12915.15372
REPAY_JUNREPAY_JUN0.11994.78331
REPAY_JULREPAY_JUL0.09983.98241
SUMMARY OF FINDINGS
  • All the 30,000 observations were used in the model because there were no missing values in the predictor variables.
  • The Cost-Complexity Analysis table shows that the average square error (ASE) is minimized when the number of leaves is 31.
  • The Confusion Matrix shows a prediction accuracy of 95.4% (1-0.0464) for people that will not default on the next month's payment. This is also the measure of sensitivity in the model.
  • The prediction accuracy of those expected to default on the next month's payments is much lower at 37.6%(1-0.6242).This is the specificity measure of the model.
  • The Model-Based Fit Statistics table shows a misclassification rate of 0.1742 ((1,083 + 4,142 )/ 30,000). These are thenumber of people who were misclassified in the Confusion Matrix.
  • The ROC Curve has an AUC measure of 0.73, indicating that the training model does not fit the data perfectly (AUC=1). 
  • The Variable of Importance table shows that the predictors REPAY_SEP and REPAY_AUG have the largest importance values. They are the most important consideration in determining if a person will default in the next credit card payment.



Reference
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

No comments:

Post a Comment