Sunday, May 15, 2016

Machine Learning Data Analysis: A K-Means Cluster Analysis

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository.  The objective was to complete a K-Means Cluster analysis, an unlearned machine learning process of partitioning a group of data points into a small number of clusters in a way that minimizes the distance between the data points and their assigned cluster centroids.

ATTRIBUTE INFORMATION
A dataset representing students' status of their knowledge of goal subjects and related subjects. Data types: Real, integer
283 observations.
STG: The degree of study time for goal subject(input value). 
SCG: The degree of repetition for goal subject(input value). 
STR: The degree of study time for related subjects (input value). 
LPR: The exam performance for related subjects (input value).   
PEG: The exam performance for goal subject(input value).
UNS: The knowledge level (target value: 1=high, 2=medium, 3=low, 4=very low).

SAS CODE
***********************************************************************
Cluster analysis is an unsupervised learning method (there is no specific
response variable included in the analysis). 
The goal of cluster analysis is to group or cluster observations into subsets 
based on the similarity of responses on multiple variables. Observations that 
have similar response patterns are grouped together to form clusters. 
The goal is to partition the observations in a data set.
**********************************************************************;

%web_drop_table(WORK.knowledge);

PROC IMPORT DATAFILE="/home/mst07221/sasuser.v94/user_knowledge.csv"
DBMS=CSV
OUT=WORK.knowledge;
GETNAMES=YES;
RUN;

*********************************************************************
DATA MANAGEMENT
*********************************************************************;
data clust;
set work.knowledge;

* create a unique identifier to merge cluster assignment variable with 
the main data set;
idnum=_n_;

* delete observations with missing data;
 if cmiss(of _all_) then delete;
 run;

ods graphics on;

***********************************************************************
Use PROC SURVEYSELECT for probability-based random sampling. Split data randomly into test and training data. Avoids selection bias and enables the use of statistical theory to make valid inferences from the sample to the survey population;
***********************************************************************
proc surveyselect 
 data=clust 
 out=traintest 
 seed = 123
 samprate=0.7 
 method=srs 
 outall;
run;   

* 70% training sample;
data clus_train;
set traintest;
if selected=1;
run;

* 30% test sample;
data clus_test;
set traintest;
if selected=0;
run;

* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1; 
var STG SCG STR LPR PEG; 
run; 


* k-means clustering is a method of cluster analysis which aims to partition 
* n observations into k clusters in which each observation belongs to the cluster with the nearest mean;
%macro kmean(K);

***********************************************************************
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables. It uses Euclidean distances, so the cluster centers are based on least squares estimation. This kind of clustering method is often called a k-means model, since the cluster centers are the means of the observations assigned to each cluster when the algorithm is run to complete convergence. Each iteration reduces the least squares criterion until convergence is achieved.

out= Specifies output SAS data set containing original data and cluster assignments.
outstat= Specifies output SAS data set containing statistics
maxclusters= Specifies maximum number of clusters.
maxiter= Specifies maximum number of iterations;
***********************************************************************
proc fastclus data=clustvar 
out=outdata&K. 
outstat=cluststat&K. 
maxclusters= &K. 
maxiter=300;
var STG SCG STR LPR PEG;
run;

* End macro;
%mend;

* Print the output and create the output data sets for K equals 1 to 12 clusters;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
%kmean(10);
%kmean(11);
%kmean(12);

* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;

data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;

data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;

data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;

data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;

data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;

data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;

data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;

data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;

data clus10;
set cluststat10;
nclust=10;
if _type_='RSQ';
keep nclust over_all;
run;

data clus11;
set cluststat11;
nclust=11;
if _type_='RSQ';
keep nclust over_all;
run;

data clus12;
set cluststat12;
nclust=12;
if _type_='RSQ';
keep nclust over_all;
run;

data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9 clus10 clus11 clus12;
run;

** R-square values from each cluster solution;
proc print data=clusrsquare;
run;

* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
 plot over_all*nclust;
 run;

**********************************************************************
Further examine cluster solution for the number of clusters suggested by the elbow curve
***********************************************************************
The CANDISC procedure performs a canonical discriminant analysis, which finds linear combinations of the quantitative variables that provide maximal separation between classes or groups. 

Use canonical discriminate analysis (a data reduction technique) to create a smaller number of variables that are linear combinations of the 11 clustering variables. Plot clusters for 4 cluster solution. The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variables that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance. The second canonical variable will account for the next largest proportion of variance and so on. 

clustcan = includes the canonical variables that are estimated by the canonical
discriminate analysis.
cluster = assignment variable is a categorical variable because it has four 
categories.
**********************************************************************;
proc candisc data=outdata4  out=clustcan;
class cluster;
var STG SCG STR LPR PEG;
run;

proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
proc sgplot data=clustcan;
scatter y=can3 x=can1 / group=cluster;
run;

* Validate clusters on UNS (The knowledge level of user)
* See how the clusters differ on UNS;
* First merge clustering variable and assignment data with UNS data;
data UNS_data;
set clus_train;
keep idnum UNS;
run;

proc sort data=outdata4;
by idnum;
run;

proc sort data=UNS_data;
by idnum;
run;

data merged;
merge outdata4 UNS_data;
by idnum;
run;

proc sort data=merged;
by cluster;
run;

proc means data=merged;
var UNS;
by cluster;
run;

**********************************************************************
Use the ANOVA procedure here to test whether there are significant differences between clusters and UNS. Use the class statement to indicate that the cluster membership variable is categorical. The model statement specifies the model with UNS as the response variable and cluster as the explanatory variable. The box plot shows the mean UNS by cluster. 
**********************************************************************;
proc anova data=merged;
class cluster;
model UNS = cluster;
means cluster/tukey;
run;

%web_open_table(WORK.knowledge);



RESULTS

The SURVEYSELECT Procedure
Selection MethodSimple Random Sampling
Input Data SetCLUST
Random Number Seed123
Sampling Rate0.7
Sample Size283
Selection Probability0.702233
Sampling Weight0
Output Data SetTRAINTEST

** excluded clusters 1 - 11 data from blog due to space issues **

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=12 Maxiter=300 Converge=0.02
Initial Seeds
ClusterSTGSCGSTRLPRPEG
1-1.675358549-1.675404264-1.842762469-1.666000968-1.699570022
2-0.7418182061.9592762001.8423331881.5525218931.505695120
31.4986786171.167102253-0.465913542-0.449242325-0.832262984
4-1.3019424120.9341099150.991926498-1.116497065-1.322480006
5-1.301942412-1.302616524-1.5187980162.180526354-0.794553982
62.9456661480.607920643-1.5592935731.0815185480.902351093
71.918771771-0.3240487071.1539087241.5525218930.864642091
8-1.0218803091.353496123-1.073346893-0.6454937190.789224088
90.4251072231.353496123-1.0733468932.2197766320.374425070
10-1.161911360-0.2774502391.3563865081.474021335-0.492881969
11-1.208588377-1.2094195890.991926498-1.0772467861.694240128
120.6351538002.1456700700.870439828-1.6267506891.807367133
Minimum Distance Between Initial Seeds =2.952894
Iteration History
IterationCriterionRelative Change in Cluster Seeds
123456789101112
10.85150.56050.32590.44810.38930.51170.41880.40590.40710.36730.35580.43900.3902
20.61960.11220.14450.07390.07890.09760.08510.08390.07110.16070.06910.06590.1154
30.60310.02470.05320.08600.06660.030500.04190.05860.07000.04490.03250.0644
40.59730.028800.04460.047800.03670.06780.05080.03700.027100.0721
50.59310.071800.04200.04940.02050.03270.05840.092200.03400.04200.0497
60.58730.081100.02420.05990.034300.05450.05440.03020.01420.04820
70.58280.035800.02320.04040.046300.04890.02300.03740.02820.03440
80.58100.01640.04050.02660000.0300000.020800
90.580600000.022000000.015800
100.58040.02040000.045600000.014200
110.57990.01730000.03310000000
120.579700000.03380000.0465000
130.5794000000000000
Convergence criterion is satisfied.
Criterion Based on Final Seeds =0.5794
Cluster Summary
ClusterFrequencyRMS Std DeviationMaximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest ClusterDistance Between
Cluster Centroids
1380.55592.2218 51.8793
2120.74102.3100 122.0017
3200.61581.7140 11.9677
4250.55222.1841 11.8845
5220.56472.0978 101.8659
6170.68832.2116 32.2651
7220.60932.0670 111.9511
8300.53351.8616 111.7155
9160.61201.9485 52.0087
10330.62382.2158 51.8659
11280.51992.0924 81.7155
12200.62041.8267 22.0017
Statistics for Variables
VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ)
STG1.000000.643630.6018951.511900
SCG1.000000.657430.5846411.407553
STR1.000000.529400.7306632.712825
LPR1.000000.557640.7011702.346382
PEG1.000000.561380.6971462.301921
OVER-ALL1.000000.592090.6631031.968266
Pseudo F Statistic =48.49
Approximate Expected Over-All R-Squared =0.65781
Cubic Clustering Criterion =0.685
WARNING: The two values above are invalid for correlated variables.
Cluster Means
ClusterSTGSCGSTRLPRPEG
1-0.616281597-0.440790131-1.094660343-0.640225919-0.875926038
20.0232959001.9670426110.7658263061.2614156591.222877607
31.227951917-0.127403174-0.698762993-0.655306289-0.411807615
4-0.650517960-0.5477213510.779729781-0.799354812-0.863938545
5-0.516919851-0.764827847-0.6444621331.147530380-0.741418571
61.927008892-0.202618583-0.9542423140.9637677110.984423626
71.086011352-0.2535155720.979041548-0.3047299350.557827941
8-0.228526607-0.198232845-0.703487475-0.6939023960.950115828
90.0020967551.167102253-0.7266036891.091331117-0.589511286
100.073898030-0.0030841110.9526580791.140988667-0.778556224
11-0.777992894-0.4305594900.892133876-0.5207339041.054533850
12-0.2134343721.7332736330.603169154-0.6180185240.642158981
Cluster Standard Deviations
ClusterSTGSCGSTRLPRPEG
10.5138161020.7195731470.5048891990.5618349190.438842474
21.0905443670.6042574680.7906224580.5536827810.509103263
30.4656879750.7864061900.5755312540.4407442890.731744380
40.6368250100.7276116060.4911606330.4565984550.373725188
50.5681036620.5410788320.5926175840.4868400550.625133678
60.6738396300.6548752530.4128438560.8612731170.757499991
70.6288165000.6033947550.4661617070.7702305090.534844726
80.4995493450.6945064020.4841668920.4759048050.480052576
90.9016390110.2995862310.4406315420.5829690520.660495348
100.8125572750.8225543260.4127872080.4857094350.450148029
110.3976000840.5910107500.5081234280.5670880350.514024308
120.6644337680.3575382700.7526189650.4181291700.783470812

ObsOVER_ALLnclust
10.000001
20.181382
30.297163
40.387434
50.469955
60.511566
70.537867
80.573018
90.594629
100.6226810
110.6336411
120.6631012

Plot of OVER_ALL by nclust

The CANDISC Procedure
Total Sample Size283DF Total282
Variables5DF Within Classes279
Classes4DF Between Classes3
Number of Observations Read283
Number of Observations Used283
Class Level Information
CLUSTERVariable
Name
FrequencyWeightProportion
118787.00000.307420
228585.00000.300353
335454.00000.190813
445757.00000.201413

The CANDISC Procedure
Multivariate Statistics and F Approximations
S=3 M=0.5 N=136.5
StatisticValueF ValueNum DFDen DFPr > F
NOTE: F Statistic for Roy's Greatest Root is an upper bound.
Wilks' Lambda0.0741229979.3315759.56<.0001
Pillai's Trace1.7252307974.9815831<.0001
Hotelling-Lawley Trace4.2203891277.1115514.23<.0001
Roy's Greatest Root1.7697515798.045277<.0001

The CANDISC Procedure
 Canonical
Correlation
Adjusted
Canonical
Correlation
Approximate
Standard
Error
Squared
Canonical
Correlation
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
Test of H0: The canonical correlations in the current row and all that follow are zero
 EigenvalueDifferenceProportionCumulativeLikelihood
Ratio
Approximate
F Value
Num DFDen DFPr > F
10.7993480.7839170.0215000.6389571.76980.25950.41930.41930.0741229979.3315759.56<.0001
20.775649.0.0237230.6016311.51020.56980.35780.77720.2053022783.288552<.0001
30.696163.0.0306890.4846430.9404 0.22281.00000.5153568486.833277<.0001

The CANDISC Procedure
Total Canonical Structure
VariableCan1Can2Can3
STG0.0833590.8438490.477962
SCG0.8549540.199347-0.465622
STR0.583251-0.4697150.604287
LPR0.1336260.3557060.003222
PEG0.4405750.1613480.298804
Between Canonical Structure
VariableCan1Can2Can3
STG0.0903780.8877770.451314
SCG0.8852110.200282-0.419868
STR0.642178-0.5018380.579453
LPR0.3610220.9325270.007582
PEG0.8233390.2925850.486319
Pooled Within Canonical Structure
VariableCan1Can2Can3
STG0.0741380.7883480.507876
SCG0.8082480.197959-0.525909
STR0.509611-0.4311030.630813
LPR0.0840550.2350320.002422
PEG0.2928710.1126640.237311

The CANDISC Procedure
Total-Sample Standardized Canonical Coefficients
VariableCan1Can2Can3
STG0.0285931571.2114465950.669693195
SCG1.2287386650.307650672-0.867348590
STR0.735707307-0.7588766690.944214104
LPR0.0423902550.3490326080.006043056
PEG0.3806743250.0725501400.304594602
Pooled Within-Class Standardized Canonical Coefficients
VariableCan1Can2Can3
STG0.01942110370.82284128750.4548704112
SCG0.78516259060.1965884250-.5542347488
STR0.5086556530-.52467456010.6528137448
LPR0.04070953870.33519393530.0058034569
PEG0.34593787320.06592995500.2768004086
Raw Canonical Coefficients
VariableCan1Can2Can3
STG0.0285931571.2114465950.669693195
SCG1.2287386650.307650672-0.867348590
STR0.735707307-0.7588766690.944214104
LPR0.0423902550.3490326080.006043056
PEG0.3806743250.0725501400.304594602
Class Means on Canonical Variables
CLUSTERCan1Can2Can3
1-1.520363604-0.361951851-0.882501324
20.141975888-1.0755138641.195253975
32.319402827-0.184733649-1.025564852
4-0.0884924492.3313001150.536167180

The SGPlot Procedure

The SGPlot Procedure

The MEANS Procedure
Analysis Variable : UNS
NMeanStd DevMinimumMaximum
873.06896550.81829521.00000004.0000000
Analysis Variable : UNS
NMeanStd DevMinimumMaximum
852.17647060.84763641.00000004.0000000
Analysis Variable : UNS
NMeanStd DevMinimumMaximum
541.90740740.83028791.00000004.0000000
Analysis Variable : UNS
NMeanStd DevMinimumMaximum
571.87719300.86747121.00000004.0000000

The ANOVA Procedure
Class Level Information
ClassLevelsValues
CLUSTER41 2 3 4
Number of Observations Read283
Number of Observations Used283

The ANOVA Procedure
Dependent Variable: UNS
SourceDFSum of SquaresMean SquareF ValuePr > F
Model370.818093023.606031033.50<.0001
Error279196.61653600.7047188  
Corrected Total282267.4346290   
R-SquareCoeff VarRoot MSEUNS Mean
0.26480535.886930.8394752.339223
SourceDFAnova SSMean SquareF ValuePr > F
CLUSTER370.8180929923.6060310033.50<.0001
Distribution of UNS by CLUSTER

The ANOVA Procedure
Distribution of UNS by CLUSTER

The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for UNS
Note:This test controls the Type I experimentwise error rate.
Alpha0.05
Error Degrees of Freedom279
Error Mean Square0.704719
Critical Value of Studentized Range3.65515
Comparisons significant at the 0.05 level are indicated by ***.
CLUSTER
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits 
1 - 20.89250.56161.2234***
1 - 31.16160.78571.5374***
1 - 41.19180.82201.5615***
2 - 1-0.8925-1.2234-0.5616***
2 - 30.2691-0.10850.6466 
2 - 40.2993-0.07220.6707 
3 - 1-1.1616-1.5374-0.7857***
3 - 2-0.2691-0.64660.1085 
3 - 40.0302-0.38180.4422 
4 - 1-1.1918-1.5615-0.8220***
4 - 2-0.2993-0.67070.0722 
4 - 3-0.0302-0.44220.3818
SUMMARY OF FINDINGS
  • The G-plot shows the R-square value increases as more clusters are specified.
  • There is a bend in the Elbow curve after the 4th cluster where the R-square value might be leveling off. 
  • Canonical discriminate analysis reduces the 12 clusters to 4 distinct clusters that are visualized in the scatter plot.
  • The box plots shows that cluster 1 corresponds to the highest mean UNS (knowledge level) followed by clusters 2, 3 and 4. There is little variation in the mean UNS between clusters 3&4,2&3 and 2&4.
  • The Tukey test shows that the clusters differed significantly in mean UNS for clusters 1&2, 1&3 and 1&4.





Reference
H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.