Saturday, April 30, 2016

Machine Learning Data Analysis: Random Forests

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository.  The objective was to use random forests as a data mining technique to predict the probability of a person defaulting on their credit card payments. This evaluation is useful in credit scoring models.

ATTRIBUTE INFORMATION:
  • There are 30,000 observations.
  • The binary variable, default_payment_next_month (Yes=2, No=1) is the response variable. 
  • There are 23 explanatory variables: 
  • LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.  
  • SEX: (1 = male; 2 = female). 
  • EDUCATION : (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
  • MARRIAGE:  (1 = married; 2 = single; 3 = others). 
  • AGE:  (year).
  • History of past payment: REPAY_SEP, REPAY_AUG, REPAY_JUL, REPAY_JUN, REPAY_MAY, REPAY_APR. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
  • Amount of bill statement: BILL_SEP, BILL_AUG, BILL_JUL, BILL_JUN, BILL_APR, BILL_MAY
  • Amount of previous payment: PAY_AMT_SEP, PAY_AMT_AUG, PAY_AMT_JUL, PAY_AMT_JUN, PAY_AMT_MAY, PAY_AMT_APR;



SAS CODE


/* Source File: default of credit card clients.xls */
/* Source Path: /home/mst07221 */

%web_drop_table(WORK.IMPORT2);

FILENAME REFFILE "/home/mst07221/default of credit card clients.xls" TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE
DBMS=XLS
OUT=WORK.IMPORT2;
GETNAMES=YES;
RUN;

PROC SORT DATA=WORK.IMPORT2;
BY LIMIT_BAL;

ods graphics on;

PROC HPFOREST;
target  default_payment_next_month/level=nominal;
input   
SEX EDUCATION MARRIAGE REPAY_SEP REPAY_AUG REPAY_JUL REPAY_JUN REPAY_MAY REPAY_APR /level=nominal;
input   
LIMIT_BAL AGE BILL_SEP BILL_AUG BILL_JUL BILL_JUN BILL_APR BILL_MAY PAY_AMT_SEP PAY_AMT_AUG PAY_AMT_JUL PAY_AMT_JUN PAY_AMT_MAY PAY_AMT_APR /level=interval;

RUN;
%web_open_table(WORK.IMPORT2);

RESULTS
The HPFOREST Procedure


Performance Information
Execution ModeSingle-Machine
Number of Threads2
Data Access Information
DataEngineRolePath
WORK.IMPORT2V9InputOn Client
Model Information
ParameterValue
Variables to Try5(Default)
Maximum Trees100(Default)
Inbag Fraction0.6(Default)
Prune Fraction0(Default)
Prune Threshold0.1(Default)
Leaf Fraction0.00001(Default)
Leaf Size Setting1(Default)
Leaf Size Used1
Category Bins30(Default)
Interval Bins100
Minimum Category Size5(Default)
Node Size100000(Default)
Maximum Depth20(Default)
Alpha1(Default)
Exhaustive5000(Default)
Rows of Sequence to Skip5(Default)
Split Criterion.Gini
Preselection Method.Loh
Missing Value Handling.Valid value
Number of Observations
TypeN
Number of Observations Read30000
Number of Observations Used30000
Baseline Fit Statistics
StatisticValue
Average Square Error0.172
Misclassification Rate0.221
Log Loss0.528
Fit Statistics
Number
of Trees
Number
of Leaves
Average
Square
Error
(Train)
Average
Square
Error
(OOB)
Misclassification
Rate
(Train)
Misclassification
Rate
(OOB)
Log
Loss
(Train)
Log
Loss
(OOB)
132940.11390.2370.12690.2491.9664.767
265200.07980.2250.11350.2440.5134.096
396030.06870.2120.08790.2390.2803.443
4127680.06320.2020.08220.2340.2272.954
5157750.06030.1930.07580.2290.2132.520
6187890.05840.1860.07360.2230.2082.191
7217270.05700.1790.07130.2190.2061.893
8247560.05590.1740.06920.2150.2051.644
9275650.05530.1690.06720.2100.2031.445
10305250.05480.1650.06600.2080.2031.275
11333420.05450.1620.06520.2060.2031.142
12361120.05440.1600.06540.2050.2031.037
13391690.05400.1570.06570.2020.2030.942
14420150.05380.1550.06480.2010.2030.862
15449590.05360.1540.06420.2000.2030.811
16478600.05340.1520.06410.1990.2030.754
17509070.05310.1510.06360.1960.2020.714
18539230.05280.1500.06360.1960.2020.674
19569810.05260.1490.06280.1940.2020.645
20600940.05230.1480.06300.1950.2010.618
21630980.05210.1480.06240.1940.2010.596
22662960.05170.1470.06180.1940.2000.580
23692110.05160.1460.06210.1930.2000.559
24724370.05120.1460.06140.1920.1990.550
25757340.05090.1450.06060.1920.1980.540
26788410.05070.1450.06040.1920.1980.530
27818580.05050.1450.06010.1910.1970.521
28849110.05050.1440.05990.1920.1970.515
29877020.05050.1440.06040.1910.1980.509
30906810.05050.1440.06030.1910.1980.505
31937470.05040.1430.06000.1910.1970.497
32965880.05040.1430.06020.1890.1980.492
33997780.05020.1430.05920.1900.1970.489
341027280.05020.1430.05950.1890.1970.486
351059480.05000.1420.05950.1880.1970.482
361089080.05000.1420.05900.1890.1970.478
371119800.04990.1420.05890.1890.1960.475
381149070.04990.1420.05920.1880.1960.473
391179710.04980.1420.05880.1880.1960.471
401208640.04980.1410.05900.1880.1970.470
411237260.04990.1410.05840.1870.1970.468
421266620.04980.1410.05860.1870.1970.466
431295570.04980.1410.05910.1870.1970.466
441323040.04990.1410.05930.1880.1970.461
451352090.04990.1410.05890.1870.1970.460
461380210.04990.1400.05920.1880.1970.458
471413150.04980.1400.05910.1870.1970.457
481444020.04970.1400.05880.1870.1970.455
491474250.04970.1400.05870.1870.1960.453
501504910.04960.1400.05850.1870.1960.452
511536790.04950.1400.05850.1870.1960.451
521566850.04950.1400.05860.1860.1960.450
531594720.04960.1400.05880.1870.1970.450
541626720.04950.1400.05840.1870.1960.450
551655620.04950.1390.05790.1860.1960.449
561686370.04950.1390.05800.1870.1960.449
571713940.04950.1390.05830.1860.1960.448
581742670.04950.1390.05840.1860.1960.448
591773500.04950.1390.05830.1860.1960.447
601804050.04940.1390.05830.1860.1960.447
611833450.04940.1390.05800.1860.1960.447
621861200.04940.1390.05850.1860.1960.446
631887340.04950.1390.05850.1860.1960.446
641915630.04950.1390.05870.1860.1960.446
651946190.04950.1390.05860.1860.1960.446
661974980.04950.1390.05880.1850.1960.445
672004320.04950.1380.05870.1850.1960.444
682033180.04950.1380.05920.1860.1970.444
692063920.04950.1380.05880.1860.1960.444
702096810.04940.1380.05850.1860.1960.444
712126920.04940.1380.05830.1860.1960.443
722157270.04940.1380.05800.1850.1960.442
732186630.04940.1380.05790.1850.1960.442
742216960.04930.1380.05800.1850.1960.442
752246670.04930.1380.05810.1850.1960.442
762276490.04930.1380.05820.1840.1960.442
772306080.04930.1380.05830.1840.1960.441
782337550.04930.1380.05790.1840.1960.441
792363900.04930.1380.05800.1840.1960.441
802394390.04930.1380.05770.1840.1960.441
812421460.04930.1380.05800.1840.1960.441
822451040.04930.1380.05770.1840.1960.440
832481850.04930.1380.05800.1840.1960.440
842510730.04930.1380.05800.1840.1960.440
852539840.04930.1380.05770.1840.1960.440
862569120.04930.1380.05780.1840.1960.440
872600620.04920.1380.05760.1830.1960.440
882631800.04920.1380.05770.1840.1960.440
892663730.04910.1380.05760.1840.1960.440
902692470.04910.1380.05760.1840.1960.440
912722390.04910.1380.05760.1840.1960.440
922753750.04900.1380.05740.1840.1950.439
932783320.04910.1380.05710.1850.1960.439
942813490.04910.1370.05730.1840.1960.439
952842730.04900.1370.05740.1850.1960.439
962870590.04910.1370.05780.1840.1960.439
972898190.04910.1370.05770.1840.1960.439
982929970.04910.1370.05770.1840.1960.439
992959070.04910.1370.05760.1840.1960.439
1002989310.04910.1370.05770.1840.1960.439
Loss Reduction Variable Importance
VariableNumber
of Rules
GiniOOB
Gini
MarginOOB
Margin
REPAY_SEP45200.0337550.029780.0675090.06421
REPAY_AUG32430.0122170.010020.0244340.02243
REPAY_JUL31640.0089820.006740.0179650.01581
REPAY_JUN33690.0054760.003110.0109530.00893
REPAY_MAY35770.0049000.002440.0098000.00777
REPAY_APR37380.0045770.001820.0091540.00685
SEX62550.002262-0.001840.0045240.00045
MARRIAGE65200.002317-0.002210.0046350.00038
EDUCATION63830.002914-0.002660.0058290.00053
PAY_AMT_SEP167730.013924-0.009350.0278480.00424
BILL_APR145660.010728-0.009810.0214560.00069
BILL_JUN150110.011076-0.010020.0221530.00076
LIMIT_BAL197210.014944-0.010240.0298890.00483
PAY_AMT_AUG182660.014662-0.010620.0293230.00383
BILL_JUL159480.011740-0.010640.0234790.00094
BILL_SEP162410.012512-0.010810.0250240.00137
BILL_AUG166910.012304-0.010880.0246080.00121
BILL_MAY156930.011328-0.011110.0226570.00011
PAY_AMT_JUL193240.013843-0.012210.0276860.00152
PAY_AMT_JUN211090.014101-0.013520.0282020.00043
AGE197890.014175-0.013940.0283490.00049
PAY_AMT_MAY227550.014860-0.014740.029721-0.00008
PAY_AMT_APR261750.016937-0.016690.0338740.00018


SUMMARY OF FINDINGS
  • All the 30,000 observations were used in the model because there were no missing values in the predictor variables.
  • The ‘Variables to Try’ parameter indicates that 5 of the 23 explanatory variables were randomly selected to be considered for a splitting rule.
  • PROC HPFOREST first computes baseline statistics without using a model. The Baseline Fit Statistics table shows a baseline misclassification rate of 0.221 because that is the proportion of observations for which default_payment_next_month =2 (Yes).
  • The Fit Statistics table shows that as the number of trees increases, the fit statistics improve (decrease) at first and then level off and fluctuate in a small range. They decrease from 0.1139 to 0.0491.
  • The table also provides an alternative estimate of average square error (ASE) and misclassification rate - the out-of-bag (OOB) estimate. This is a convenient substitute for an estimate that is based on test data and is a less biased estimate of how the model will perform on future data. The ASE OOB  is worse (larger) than the estimate that evaluates all observations on all trees. The OOB misclassification rate decreases to values that are below the baseline misclassification rate (good model).
  • The Loss Reduction Variable Importance table shows that each measure is computed twice: once on training data and once on out-of-bag data. As with fit statistics, the out-of-bag estimates are less biased. The rows are sorted by the OOB Gini measure, which is a more stringent measure than the OOB margin measure. The OOB Gini column is negative for 17 of the 23 variables, and the OOB margin column is negative for one of the 23 variables.
  • We can conclude that the REPAY_SEP value is the most important predictor of whether one will default on their next month's credit card payment.






Reference


Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.