Analytics: May 2016

The analysis was generated using SAS Studio and the data was sourced from the UCI Machine Learning Repository. The objective was to complete a K-Means Cluster analysis, an unlearned machine learning process of partitioning a group of data points into a small number of clusters in a way that minimizes the distance between the data points and their assigned cluster centroids.

ATTRIBUTE INFORMATION

A dataset representing students' status of their knowledge of goal subjects and related subjects. Data types: Real, integer

283 observations.

STG: The degree of study time for goal subject(input value).

SCG: The degree of repetition for goal subject(input value).

STR: The degree of study time for related subjects (input value).

LPR: The exam performance for related subjects (input value).

PEG: The exam performance for goal subject(input value).

UNS: The knowledge level (target value: 1=high, 2=medium, 3=low, 4=very low).

SAS CODE

***********************************************************************
Cluster analysis is an unsupervised learning method (there is no specific
response variable included in the analysis).
The goal of cluster analysis is to group or cluster observations into subsets
based on the similarity of responses on multiple variables. Observations that
have similar response patterns are grouped together to form clusters.
The goal is to partition the observations in a data set.
**********************************************************************;

%web_drop_table(WORK.knowledge);

PROC IMPORT DATAFILE="/home/mst07221/sasuser.v94/user_knowledge.csv"
DBMS=CSV
OUT=WORK.knowledge;
GETNAMES=YES;
RUN;

*********************************************************************
DATA MANAGEMENT
*********************************************************************;
data clust;
set work.knowledge;

* create a unique identifier to merge cluster assignment variable with
the main data set;
idnum=_n_;

* delete observations with missing data;
if cmiss(of _all_) then delete;
run;

ods graphics on;

***********************************************************************
Use PROC SURVEYSELECT for probability-based random sampling. Split data randomly into test and training data. Avoids selection bias and enables the use of statistical theory to make valid inferences from the sample to the survey population;
***********************************************************************
proc surveyselect
data=clust
out=traintest
seed = 123
samprate=0.7
method=srs
outall;
run;

* 70% training sample;
data clus_train;
set traintest;
if selected=1;
run;

* 30% test sample;
data clus_test;
set traintest;
if selected=0;
run;

* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1;
var STG SCG STR LPR PEG;
run;

* k-means clustering is a method of cluster analysis which aims to partition
* n observations into k clusters in which each observation belongs to the cluster with the nearest mean;
%macro kmean(K);

***********************************************************************
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables. It uses Euclidean distances, so the cluster centers are based on least squares estimation. This kind of clustering method is often called a k-means model, since the cluster centers are the means of the observations assigned to each cluster when the algorithm is run to complete convergence. Each iteration reduces the least squares criterion until convergence is achieved.

out= Specifies output SAS data set containing original data and cluster assignments.
outstat= Specifies output SAS data set containing statistics
maxclusters= Specifies maximum number of clusters.
maxiter= Specifies maximum number of iterations;
***********************************************************************
proc fastclus data=clustvar
out=outdata&K.
outstat=cluststat&K.
maxclusters= &K.
maxiter=300;
var STG SCG STR LPR PEG;
run;

* End macro;
%mend;

* Print the output and create the output data sets for K equals 1 to 12 clusters;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
%kmean(10);
%kmean(11);
%kmean(12);

* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;

data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;

data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;

data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;

data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;

data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;

data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;

data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;

data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;

data clus10;
set cluststat10;
nclust=10;
if _type_='RSQ';
keep nclust over_all;
run;

data clus11;
set cluststat11;
nclust=11;
if _type_='RSQ';
keep nclust over_all;
run;

data clus12;
set cluststat12;
nclust=12;
if _type_='RSQ';
keep nclust over_all;
run;

data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9 clus10 clus11 clus12;
run;

** R-square values from each cluster solution;
proc print data=clusrsquare;
run;

* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;
run;

**********************************************************************
Further examine cluster solution for the number of clusters suggested by the elbow curve
***********************************************************************
The CANDISC procedure performs a canonical discriminant analysis, which finds linear combinations of the quantitative variables that provide maximal separation between classes or groups.

Use canonical discriminate analysis (a data reduction technique) to create a smaller number of variables that are linear combinations of the 11 clustering variables. Plot clusters for 4 cluster solution. The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variables that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance. The second canonical variable will account for the next largest proportion of variance and so on.

clustcan = includes the canonical variables that are estimated by the canonical
discriminate analysis.
cluster = assignment variable is a categorical variable because it has four
categories.
**********************************************************************;
proc candisc data=outdata4 out=clustcan;
class cluster;
var STG SCG STR LPR PEG;
run;

proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
proc sgplot data=clustcan;
scatter y=can3 x=can1 / group=cluster;
run;

* Validate clusters on UNS (The knowledge level of user)
* See how the clusters differ on UNS;
* First merge clustering variable and assignment data with UNS data;
data UNS_data;
set clus_train;
keep idnum UNS;
run;

proc sort data=outdata4;
by idnum;
run;

proc sort data=UNS_data;
by idnum;
run;

data merged;
merge outdata4 UNS_data;
by idnum;
run;

proc sort data=merged;
by cluster;
run;

proc means data=merged;
var UNS;
by cluster;
run;

**********************************************************************
Use the ANOVA procedure here to test whether there are significant differences between clusters and UNS. Use the class statement to indicate that the cluster membership variable is categorical. The model statement specifies the model with UNS as the response variable and cluster as the explanatory variable. The box plot shows the mean UNS by cluster.
**********************************************************************;
proc anova data=merged;
class cluster;
model UNS = cluster;
means cluster/tukey;
run;

%web_open_table(WORK.knowledge);

RESULTS

The SURVEYSELECT Procedure

Selection Method	Simple Random Sampling

Input Data Set	CLUST
Random Number Seed	123
Sampling Rate	0.7
Sample Size	283
Selection Probability	0.702233
Sampling Weight	0
Output Data Set	TRAINTEST

** excluded clusters 1 - 11 data from blog due to space issues **

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=12 Maxiter=300 Converge=0.02

Initial Seeds
Cluster	STG	SCG	STR	LPR	PEG
1	-1.675358549	-1.675404264	-1.842762469	-1.666000968	-1.699570022
2	-0.741818206	1.959276200	1.842333188	1.552521893	1.505695120
3	1.498678617	1.167102253	-0.465913542	-0.449242325	-0.832262984
4	-1.301942412	0.934109915	0.991926498	-1.116497065	-1.322480006
5	-1.301942412	-1.302616524	-1.518798016	2.180526354	-0.794553982
6	2.945666148	0.607920643	-1.559293573	1.081518548	0.902351093
7	1.918771771	-0.324048707	1.153908724	1.552521893	0.864642091
8	-1.021880309	1.353496123	-1.073346893	-0.645493719	0.789224088
9	0.425107223	1.353496123	-1.073346893	2.219776632	0.374425070
10	-1.161911360	-0.277450239	1.356386508	1.474021335	-0.492881969
11	-1.208588377	-1.209419589	0.991926498	-1.077246786	1.694240128
12	0.635153800	2.145670070	0.870439828	-1.626750689	1.807367133

Minimum Distance Between Initial Seeds =	2.952894

Iteration History
Iteration	Criterion	Relative Change in Cluster Seeds
Iteration	Criterion	1	2	3	4	5	6	7	8	9	10	11	12
1	0.8515	0.5605	0.3259	0.4481	0.3893	0.5117	0.4188	0.4059	0.4071	0.3673	0.3558	0.4390	0.3902
2	0.6196	0.1122	0.1445	0.0739	0.0789	0.0976	0.0851	0.0839	0.0711	0.1607	0.0691	0.0659	0.1154
3	0.6031	0.0247	0.0532	0.0860	0.0666	0.0305	0	0.0419	0.0586	0.0700	0.0449	0.0325	0.0644
4	0.5973	0.0288	0	0.0446	0.0478	0	0.0367	0.0678	0.0508	0.0370	0.0271	0	0.0721
5	0.5931	0.0718	0	0.0420	0.0494	0.0205	0.0327	0.0584	0.0922	0	0.0340	0.0420	0.0497
6	0.5873	0.0811	0	0.0242	0.0599	0.0343	0	0.0545	0.0544	0.0302	0.0142	0.0482	0
7	0.5828	0.0358	0	0.0232	0.0404	0.0463	0	0.0489	0.0230	0.0374	0.0282	0.0344	0
8	0.5810	0.0164	0.0405	0.0266	0	0	0	0.0300	0	0	0.0208	0	0
9	0.5806	0	0	0	0	0.0220	0	0	0	0	0.0158	0	0
10	0.5804	0.0204	0	0	0	0.0456	0	0	0	0	0.0142	0	0
11	0.5799	0.0173	0	0	0	0.0331	0	0	0	0	0	0	0
12	0.5797	0	0	0	0	0.0338	0	0	0	0.0465	0	0	0
13	0.5794	0	0	0	0	0	0	0	0	0	0	0	0

Convergence criterion is satisfied.

Criterion Based on Final Seeds =	0.5794

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	38	0.5559	2.2218		5	1.8793
2	12	0.7410	2.3100		12	2.0017
3	20	0.6158	1.7140		1	1.9677
4	25	0.5522	2.1841		1	1.8845
5	22	0.5647	2.0978		10	1.8659
6	17	0.6883	2.2116		3	2.2651
7	22	0.6093	2.0670		11	1.9511
8	30	0.5335	1.8616		11	1.7155
9	16	0.6120	1.9485		5	2.0087
10	33	0.6238	2.2158		5	1.8659
11	28	0.5199	2.0924		8	1.7155
12	20	0.6204	1.8267		2	2.0017

Statistics for Variables
Variable	Total STD	Within STD	R-Square	RSQ/(1-RSQ)
STG	1.00000	0.64363	0.601895	1.511900
SCG	1.00000	0.65743	0.584641	1.407553
STR	1.00000	0.52940	0.730663	2.712825
LPR	1.00000	0.55764	0.701170	2.346382
PEG	1.00000	0.56138	0.697146	2.301921
OVER-ALL	1.00000	0.59209	0.663103	1.968266

Pseudo F Statistic =	48.49

Approximate Expected Over-All R-Squared =	0.65781

Cubic Clustering Criterion =	0.685

WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster	STG	SCG	STR	LPR	PEG
1	-0.616281597	-0.440790131	-1.094660343	-0.640225919	-0.875926038
2	0.023295900	1.967042611	0.765826306	1.261415659	1.222877607
3	1.227951917	-0.127403174	-0.698762993	-0.655306289	-0.411807615
4	-0.650517960	-0.547721351	0.779729781	-0.799354812	-0.863938545
5	-0.516919851	-0.764827847	-0.644462133	1.147530380	-0.741418571
6	1.927008892	-0.202618583	-0.954242314	0.963767711	0.984423626
7	1.086011352	-0.253515572	0.979041548	-0.304729935	0.557827941
8	-0.228526607	-0.198232845	-0.703487475	-0.693902396	0.950115828
9	0.002096755	1.167102253	-0.726603689	1.091331117	-0.589511286
10	0.073898030	-0.003084111	0.952658079	1.140988667	-0.778556224
11	-0.777992894	-0.430559490	0.892133876	-0.520733904	1.054533850
12	-0.213434372	1.733273633	0.603169154	-0.618018524	0.642158981

Cluster Standard Deviations
Cluster	STG	SCG	STR	LPR	PEG
1	0.513816102	0.719573147	0.504889199	0.561834919	0.438842474
2	1.090544367	0.604257468	0.790622458	0.553682781	0.509103263
3	0.465687975	0.786406190	0.575531254	0.440744289	0.731744380
4	0.636825010	0.727611606	0.491160633	0.456598455	0.373725188
5	0.568103662	0.541078832	0.592617584	0.486840055	0.625133678
6	0.673839630	0.654875253	0.412843856	0.861273117	0.757499991
7	0.628816500	0.603394755	0.466161707	0.770230509	0.534844726
8	0.499549345	0.694506402	0.484166892	0.475904805	0.480052576
9	0.901639011	0.299586231	0.440631542	0.582969052	0.660495348
10	0.812557275	0.822554326	0.412787208	0.485709435	0.450148029
11	0.397600084	0.591010750	0.508123428	0.567088035	0.514024308
12	0.664433768	0.357538270	0.752618965	0.418129170	0.783470812

Obs	OVER_ALL	nclust
1	0.00000	1
2	0.18138	2
3	0.29716	3
4	0.38743	4
5	0.46995	5
6	0.51156	6
7	0.53786	7
8	0.57301	8
9	0.59462	9
10	0.62268	10
11	0.63364	11
12	0.66310	12

The CANDISC Procedure

Total Sample Size	283	DF Total	282
Variables	5	DF Within Classes	279
Classes	4	DF Between Classes	3

Number of Observations Read	283
Number of Observations Used	283

Class Level Information
CLUSTER	Variable Name	Frequency	Weight	Proportion
1	1	87	87.0000	0.307420
2	2	85	85.0000	0.300353
3	3	54	54.0000	0.190813
4	4	57	57.0000	0.201413

The CANDISC Procedure

Multivariate Statistics and F Approximations
S=3 M=0.5 N=136.5
Statistic	Value	F Value	Num DF	Den DF	Pr > F
NOTE: F Statistic for Roy's Greatest Root is an upper bound.
Wilks' Lambda	0.07412299	79.33	15	759.56	<.0001
Pillai's Trace	1.72523079	74.98	15	831	<.0001
Hotelling-Lawley Trace	4.22038912	77.11	15	514.23	<.0001
Roy's Greatest Root	1.76975157	98.04	5	277	<.0001

The CANDISC Procedure

	Canonical Correlation	Adjusted Canonical Correlation	Approximate Standard Error	Squared Canonical Correlation	Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq)				Test of H0: The canonical correlations in the current row and all that follow are zero
	Canonical Correlation	Adjusted Canonical Correlation	Approximate Standard Error	Squared Canonical Correlation	Eigenvalue	Difference	Proportion	Cumulative	Likelihood Ratio	Approximate F Value	Num DF	Den DF	Pr > F
1	0.799348	0.783917	0.021500	0.638957	1.7698	0.2595	0.4193	0.4193	0.07412299	79.33	15	759.56	<.0001
2	0.775649	.	0.023723	0.601631	1.5102	0.5698	0.3578	0.7772	0.20530227	83.28	8	552	<.0001
3	0.696163	.	0.030689	0.484643	0.9404		0.2228	1.0000	0.51535684	86.83	3	277	<.0001

The CANDISC Procedure

Total Canonical Structure
Variable	Can1	Can2	Can3
STG	0.083359	0.843849	0.477962
SCG	0.854954	0.199347	-0.465622
STR	0.583251	-0.469715	0.604287
LPR	0.133626	0.355706	0.003222
PEG	0.440575	0.161348	0.298804

Between Canonical Structure
Variable	Can1	Can2	Can3
STG	0.090378	0.887777	0.451314
SCG	0.885211	0.200282	-0.419868
STR	0.642178	-0.501838	0.579453
LPR	0.361022	0.932527	0.007582
PEG	0.823339	0.292585	0.486319

Pooled Within Canonical Structure
Variable	Can1	Can2	Can3
STG	0.074138	0.788348	0.507876
SCG	0.808248	0.197959	-0.525909
STR	0.509611	-0.431103	0.630813
LPR	0.084055	0.235032	0.002422
PEG	0.292871	0.112664	0.237311

The CANDISC Procedure

Total-Sample Standardized Canonical Coefficients
Variable	Can1	Can2	Can3
STG	0.028593157	1.211446595	0.669693195
SCG	1.228738665	0.307650672	-0.867348590
STR	0.735707307	-0.758876669	0.944214104
LPR	0.042390255	0.349032608	0.006043056
PEG	0.380674325	0.072550140	0.304594602

Pooled Within-Class Standardized Canonical Coefficients
Variable	Can1	Can2	Can3
STG	0.0194211037	0.8228412875	0.4548704112
SCG	0.7851625906	0.1965884250	-.5542347488
STR	0.5086556530	-.5246745601	0.6528137448
LPR	0.0407095387	0.3351939353	0.0058034569
PEG	0.3459378732	0.0659299550	0.2768004086

Raw Canonical Coefficients
Variable	Can1	Can2	Can3
STG	0.028593157	1.211446595	0.669693195
SCG	1.228738665	0.307650672	-0.867348590
STR	0.735707307	-0.758876669	0.944214104
LPR	0.042390255	0.349032608	0.006043056
PEG	0.380674325	0.072550140	0.304594602

Class Means on Canonical Variables
CLUSTER	Can1	Can2	Can3
1	-1.520363604	-0.361951851	-0.882501324
2	0.141975888	-1.075513864	1.195253975
3	2.319402827	-0.184733649	-1.025564852
4	-0.088492449	2.331300115	0.536167180

The MEANS Procedure

Cluster=1

Analysis Variable : UNS
N	Mean	Std Dev	Minimum	Maximum
87	3.0689655	0.8182952	1.0000000	4.0000000

Cluster=2

Analysis Variable : UNS
N	Mean	Std Dev	Minimum	Maximum
85	2.1764706	0.8476364	1.0000000	4.0000000

Cluster=3

Analysis Variable : UNS
N	Mean	Std Dev	Minimum	Maximum
54	1.9074074	0.8302879	1.0000000	4.0000000

Cluster=4

Analysis Variable : UNS
N	Mean	Std Dev	Minimum	Maximum
57	1.8771930	0.8674712	1.0000000	4.0000000

The ANOVA Procedure

Class Level Information
Class	Levels	Values
CLUSTER	4	1 2 3 4

Number of Observations Read	283
Number of Observations Used	283

The ANOVA Procedure

Dependent Variable: UNS

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	70.8180930	23.6060310	33.50	<.0001
Error	279	196.6165360	0.7047188
Corrected Total	282	267.4346290

R-Square	Coeff Var	Root MSE	UNS Mean
0.264805	35.88693	0.839475	2.339223

Source	DF	Anova SS	Mean Square	F Value	Pr > F
CLUSTER	3	70.81809299	23.60603100	33.50	<.0001

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for UNS

Note:This test controls the Type I experimentwise error rate.

Alpha	0.05
Error Degrees of Freedom	279
Error Mean Square	0.704719
Critical Value of Studentized Range	3.65515

Comparisons significant at the 0.05 level are indicated by ***.
CLUSTER Comparison	Difference Between Means	Simultaneous 95% Confidence Limits
1 - 2	0.8925	0.5616	1.2234	***
1 - 3	1.1616	0.7857	1.5374	***
1 - 4	1.1918	0.8220	1.5615	***
2 - 1	-0.8925	-1.2234	-0.5616	***
2 - 3	0.2691	-0.1085	0.6466
2 - 4	0.2993	-0.0722	0.6707
3 - 1	-1.1616	-1.5374	-0.7857	***
3 - 2	-0.2691	-0.6466	0.1085
3 - 4	0.0302	-0.3818	0.4422
4 - 1	-1.1918	-1.5615	-0.8220	***
4 - 2	-0.2993	-0.6707	0.0722
4 - 3	-0.0302	-0.4422	0.3818

SUMMARY OF FINDINGS

The G-plot shows the R-square value increases as more clusters are specified.
There is a bend in the Elbow curve after the 4th cluster where the R-square value might be leveling off.
Canonical discriminate analysis reduces the 12 clusters to 4 distinct clusters that are visualized in the scatter plot.
The box plots shows that cluster 1 corresponds to the highest mean UNS (knowledge level) followed by clusters 2, 3 and 4. There is little variation in the mean UNS between clusters 3&4,2&3 and 2&4.
The Tukey test shows that the clusters differed significantly in mean UNS for clusters 1&2, 1&3 and 1&4.

Reference
H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.

Analytics

Sunday, May 15, 2016

Machine Learning Data Analysis: A K-Means Cluster Analysis