ATTRIBUTE INFORMATION
A dataset representing students' status of their knowledge of goal subjects and related subjects. Data types: Real, integer
283 observations.
STG: The degree of study time for goal subject(input value).
SCG: The degree of repetition for goal subject(input value).
STR: The degree of study time for related subjects (input value).
LPR: The exam performance for related subjects (input value).
PEG: The exam performance for goal subject(input value).
UNS: The knowledge level (target value: 1=high, 2=medium, 3=low, 4=very low).
SAS CODE
***********************************************************************Cluster analysis is an unsupervised learning method (there is no specific
response variable included in the analysis).
The goal of cluster analysis is to group or cluster observations into subsets
based on the similarity of responses on multiple variables. Observations that
have similar response patterns are grouped together to form clusters.
The goal is to partition the observations in a data set.
**********************************************************************;
%web_drop_table(WORK.knowledge);
PROC IMPORT DATAFILE="/home/mst07221/sasuser.v94/user_knowledge.csv"
DBMS=CSV
OUT=WORK.knowledge;
GETNAMES=YES;
RUN;
*********************************************************************
DATA MANAGEMENT
*********************************************************************;
data clust;
set work.knowledge;
* create a unique identifier to merge cluster assignment variable with
the main data set;
idnum=_n_;
* delete observations with missing data;
if cmiss(of _all_) then delete;
run;
ods graphics on;
***********************************************************************
Use PROC SURVEYSELECT for probability-based random sampling. Split data randomly into test and training data. Avoids selection bias and enables the use of statistical theory to make valid inferences from the sample to the survey population;
***********************************************************************
proc surveyselect
data=clust
out=traintest
seed = 123
samprate=0.7
method=srs
outall;
run;
* 70% training sample;
data clus_train;
set traintest;
if selected=1;
run;
* 30% test sample;
data clus_test;
set traintest;
if selected=0;
run;
* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1;
var STG SCG STR LPR PEG;
run;
* k-means clustering is a method of cluster analysis which aims to partition
* n observations into k clusters in which each observation belongs to the cluster with the nearest mean;
%macro kmean(K);
***********************************************************************
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables. It uses Euclidean distances, so the cluster centers are based on least squares estimation. This kind of clustering method is often called a k-means model, since the cluster centers are the means of the observations assigned to each cluster when the algorithm is run to complete convergence. Each iteration reduces the least squares criterion until convergence is achieved.
out= Specifies output SAS data set containing original data and cluster assignments.
outstat= Specifies output SAS data set containing statistics
maxclusters= Specifies maximum number of clusters.
maxiter= Specifies maximum number of iterations;
***********************************************************************
proc fastclus data=clustvar
out=outdata&K.
outstat=cluststat&K.
maxclusters= &K.
maxiter=300;
var STG SCG STR LPR PEG;
run;
* End macro;
%mend;
* Print the output and create the output data sets for K equals 1 to 12 clusters;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
%kmean(10);
%kmean(11);
%kmean(12);
* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;
data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;
data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;
data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;
data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;
data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;
data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;
data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;
data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;
data clus10;
set cluststat10;
nclust=10;
if _type_='RSQ';
keep nclust over_all;
run;
data clus11;
set cluststat11;
nclust=11;
if _type_='RSQ';
keep nclust over_all;
run;
data clus12;
set cluststat12;
nclust=12;
if _type_='RSQ';
keep nclust over_all;
run;
data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9 clus10 clus11 clus12;
run;
** R-square values from each cluster solution;
proc print data=clusrsquare;
run;
* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;
run;
**********************************************************************
Further examine cluster solution for the number of clusters suggested by the elbow curve
***********************************************************************
The CANDISC procedure performs a canonical discriminant analysis, which finds linear combinations of the quantitative variables that provide maximal separation between classes or groups.
Use canonical discriminate analysis (a data reduction technique) to create a smaller number of variables that are linear combinations of the 11 clustering variables. Plot clusters for 4 cluster solution. The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variables that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance. The second canonical variable will account for the next largest proportion of variance and so on.
clustcan = includes the canonical variables that are estimated by the canonical
discriminate analysis.
cluster = assignment variable is a categorical variable because it has four
categories.
**********************************************************************;
proc candisc data=outdata4 out=clustcan;
class cluster;
var STG SCG STR LPR PEG;
run;
proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
proc sgplot data=clustcan;
scatter y=can3 x=can1 / group=cluster;
run;
* Validate clusters on UNS (The knowledge level of user)
* See how the clusters differ on UNS;
* First merge clustering variable and assignment data with UNS data;
data UNS_data;
set clus_train;
keep idnum UNS;
run;
proc sort data=outdata4;
by idnum;
run;
proc sort data=UNS_data;
by idnum;
run;
data merged;
merge outdata4 UNS_data;
by idnum;
run;
proc sort data=merged;
by cluster;
run;
proc means data=merged;
var UNS;
by cluster;
run;
**********************************************************************
Use the ANOVA procedure here to test whether there are significant differences between clusters and UNS. Use the class statement to indicate that the cluster membership variable is categorical. The model statement specifies the model with UNS as the response variable and cluster as the explanatory variable. The box plot shows the mean UNS by cluster.
**********************************************************************;
proc anova data=merged;
class cluster;
model UNS = cluster;
means cluster/tukey;
run;
%web_open_table(WORK.knowledge);
RESULTS
The SURVEYSELECT Procedure
Selection Method | Simple Random Sampling |
---|
Input Data Set | CLUST |
---|---|
Random Number Seed | 123 |
Sampling Rate | 0.7 |
Sample Size | 283 |
Selection Probability | 0.702233 |
Sampling Weight | 0 |
Output Data Set | TRAINTEST |
** excluded clusters 1 - 11 data from blog due to space issues **
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=12 Maxiter=300 Converge=0.02
Initial Seeds | |||||
---|---|---|---|---|---|
Cluster | STG | SCG | STR | LPR | PEG |
1 | -1.675358549 | -1.675404264 | -1.842762469 | -1.666000968 | -1.699570022 |
2 | -0.741818206 | 1.959276200 | 1.842333188 | 1.552521893 | 1.505695120 |
3 | 1.498678617 | 1.167102253 | -0.465913542 | -0.449242325 | -0.832262984 |
4 | -1.301942412 | 0.934109915 | 0.991926498 | -1.116497065 | -1.322480006 |
5 | -1.301942412 | -1.302616524 | -1.518798016 | 2.180526354 | -0.794553982 |
6 | 2.945666148 | 0.607920643 | -1.559293573 | 1.081518548 | 0.902351093 |
7 | 1.918771771 | -0.324048707 | 1.153908724 | 1.552521893 | 0.864642091 |
8 | -1.021880309 | 1.353496123 | -1.073346893 | -0.645493719 | 0.789224088 |
9 | 0.425107223 | 1.353496123 | -1.073346893 | 2.219776632 | 0.374425070 |
10 | -1.161911360 | -0.277450239 | 1.356386508 | 1.474021335 | -0.492881969 |
11 | -1.208588377 | -1.209419589 | 0.991926498 | -1.077246786 | 1.694240128 |
12 | 0.635153800 | 2.145670070 | 0.870439828 | -1.626750689 | 1.807367133 |
Minimum Distance Between Initial Seeds = | 2.952894 |
---|
Iteration History | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Iteration | Criterion | Relative Change in Cluster Seeds | |||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ||
1 | 0.8515 | 0.5605 | 0.3259 | 0.4481 | 0.3893 | 0.5117 | 0.4188 | 0.4059 | 0.4071 | 0.3673 | 0.3558 | 0.4390 | 0.3902 |
2 | 0.6196 | 0.1122 | 0.1445 | 0.0739 | 0.0789 | 0.0976 | 0.0851 | 0.0839 | 0.0711 | 0.1607 | 0.0691 | 0.0659 | 0.1154 |
3 | 0.6031 | 0.0247 | 0.0532 | 0.0860 | 0.0666 | 0.0305 | 0 | 0.0419 | 0.0586 | 0.0700 | 0.0449 | 0.0325 | 0.0644 |
4 | 0.5973 | 0.0288 | 0 | 0.0446 | 0.0478 | 0 | 0.0367 | 0.0678 | 0.0508 | 0.0370 | 0.0271 | 0 | 0.0721 |
5 | 0.5931 | 0.0718 | 0 | 0.0420 | 0.0494 | 0.0205 | 0.0327 | 0.0584 | 0.0922 | 0 | 0.0340 | 0.0420 | 0.0497 |
6 | 0.5873 | 0.0811 | 0 | 0.0242 | 0.0599 | 0.0343 | 0 | 0.0545 | 0.0544 | 0.0302 | 0.0142 | 0.0482 | 0 |
7 | 0.5828 | 0.0358 | 0 | 0.0232 | 0.0404 | 0.0463 | 0 | 0.0489 | 0.0230 | 0.0374 | 0.0282 | 0.0344 | 0 |
8 | 0.5810 | 0.0164 | 0.0405 | 0.0266 | 0 | 0 | 0 | 0.0300 | 0 | 0 | 0.0208 | 0 | 0 |
9 | 0.5806 | 0 | 0 | 0 | 0 | 0.0220 | 0 | 0 | 0 | 0 | 0.0158 | 0 | 0 |
10 | 0.5804 | 0.0204 | 0 | 0 | 0 | 0.0456 | 0 | 0 | 0 | 0 | 0.0142 | 0 | 0 |
11 | 0.5799 | 0.0173 | 0 | 0 | 0 | 0.0331 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0.5797 | 0 | 0 | 0 | 0 | 0.0338 | 0 | 0 | 0 | 0.0465 | 0 | 0 | 0 |
13 | 0.5794 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Convergence criterion is satisfied. |
Criterion Based on Final Seeds = | 0.5794 |
---|
Cluster Summary | ||||||
---|---|---|---|---|---|---|
Cluster | Frequency | RMS Std Deviation | Maximum Distance from Seed to Observation | Radius Exceeded | Nearest Cluster | Distance Between Cluster Centroids |
1 | 38 | 0.5559 | 2.2218 | 5 | 1.8793 | |
2 | 12 | 0.7410 | 2.3100 | 12 | 2.0017 | |
3 | 20 | 0.6158 | 1.7140 | 1 | 1.9677 | |
4 | 25 | 0.5522 | 2.1841 | 1 | 1.8845 | |
5 | 22 | 0.5647 | 2.0978 | 10 | 1.8659 | |
6 | 17 | 0.6883 | 2.2116 | 3 | 2.2651 | |
7 | 22 | 0.6093 | 2.0670 | 11 | 1.9511 | |
8 | 30 | 0.5335 | 1.8616 | 11 | 1.7155 | |
9 | 16 | 0.6120 | 1.9485 | 5 | 2.0087 | |
10 | 33 | 0.6238 | 2.2158 | 5 | 1.8659 | |
11 | 28 | 0.5199 | 2.0924 | 8 | 1.7155 | |
12 | 20 | 0.6204 | 1.8267 | 2 | 2.0017 |
Statistics for Variables | ||||
---|---|---|---|---|
Variable | Total STD | Within STD | R-Square | RSQ/(1-RSQ) |
STG | 1.00000 | 0.64363 | 0.601895 | 1.511900 |
SCG | 1.00000 | 0.65743 | 0.584641 | 1.407553 |
STR | 1.00000 | 0.52940 | 0.730663 | 2.712825 |
LPR | 1.00000 | 0.55764 | 0.701170 | 2.346382 |
PEG | 1.00000 | 0.56138 | 0.697146 | 2.301921 |
OVER-ALL | 1.00000 | 0.59209 | 0.663103 | 1.968266 |
Pseudo F Statistic = | 48.49 |
---|
Approximate Expected Over-All R-Squared = | 0.65781 |
---|
Cubic Clustering Criterion = | 0.685 |
---|
WARNING: The two values above are invalid for correlated variables.
Cluster Means | |||||
---|---|---|---|---|---|
Cluster | STG | SCG | STR | LPR | PEG |
1 | -0.616281597 | -0.440790131 | -1.094660343 | -0.640225919 | -0.875926038 |
2 | 0.023295900 | 1.967042611 | 0.765826306 | 1.261415659 | 1.222877607 |
3 | 1.227951917 | -0.127403174 | -0.698762993 | -0.655306289 | -0.411807615 |
4 | -0.650517960 | -0.547721351 | 0.779729781 | -0.799354812 | -0.863938545 |
5 | -0.516919851 | -0.764827847 | -0.644462133 | 1.147530380 | -0.741418571 |
6 | 1.927008892 | -0.202618583 | -0.954242314 | 0.963767711 | 0.984423626 |
7 | 1.086011352 | -0.253515572 | 0.979041548 | -0.304729935 | 0.557827941 |
8 | -0.228526607 | -0.198232845 | -0.703487475 | -0.693902396 | 0.950115828 |
9 | 0.002096755 | 1.167102253 | -0.726603689 | 1.091331117 | -0.589511286 |
10 | 0.073898030 | -0.003084111 | 0.952658079 | 1.140988667 | -0.778556224 |
11 | -0.777992894 | -0.430559490 | 0.892133876 | -0.520733904 | 1.054533850 |
12 | -0.213434372 | 1.733273633 | 0.603169154 | -0.618018524 | 0.642158981 |
Cluster Standard Deviations | |||||
---|---|---|---|---|---|
Cluster | STG | SCG | STR | LPR | PEG |
1 | 0.513816102 | 0.719573147 | 0.504889199 | 0.561834919 | 0.438842474 |
2 | 1.090544367 | 0.604257468 | 0.790622458 | 0.553682781 | 0.509103263 |
3 | 0.465687975 | 0.786406190 | 0.575531254 | 0.440744289 | 0.731744380 |
4 | 0.636825010 | 0.727611606 | 0.491160633 | 0.456598455 | 0.373725188 |
5 | 0.568103662 | 0.541078832 | 0.592617584 | 0.486840055 | 0.625133678 |
6 | 0.673839630 | 0.654875253 | 0.412843856 | 0.861273117 | 0.757499991 |
7 | 0.628816500 | 0.603394755 | 0.466161707 | 0.770230509 | 0.534844726 |
8 | 0.499549345 | 0.694506402 | 0.484166892 | 0.475904805 | 0.480052576 |
9 | 0.901639011 | 0.299586231 | 0.440631542 | 0.582969052 | 0.660495348 |
10 | 0.812557275 | 0.822554326 | 0.412787208 | 0.485709435 | 0.450148029 |
11 | 0.397600084 | 0.591010750 | 0.508123428 | 0.567088035 | 0.514024308 |
12 | 0.664433768 | 0.357538270 | 0.752618965 | 0.418129170 | 0.783470812 |
Obs | OVER_ALL | nclust |
---|---|---|
1 | 0.00000 | 1 |
2 | 0.18138 | 2 |
3 | 0.29716 | 3 |
4 | 0.38743 | 4 |
5 | 0.46995 | 5 |
6 | 0.51156 | 6 |
7 | 0.53786 | 7 |
8 | 0.57301 | 8 |
9 | 0.59462 | 9 |
10 | 0.62268 | 10 |
11 | 0.63364 | 11 |
12 | 0.66310 | 12 |
The CANDISC Procedure
Total Sample Size | 283 | DF Total | 282 |
---|---|---|---|
Variables | 5 | DF Within Classes | 279 |
Classes | 4 | DF Between Classes | 3 |
Number of Observations Read | 283 |
---|---|
Number of Observations Used | 283 |
Class Level Information | ||||
---|---|---|---|---|
CLUSTER | Variable Name | Frequency | Weight | Proportion |
1 | 1 | 87 | 87.0000 | 0.307420 |
2 | 2 | 85 | 85.0000 | 0.300353 |
3 | 3 | 54 | 54.0000 | 0.190813 |
4 | 4 | 57 | 57.0000 | 0.201413 |
The CANDISC Procedure
Multivariate Statistics and F Approximations | |||||
---|---|---|---|---|---|
S=3 M=0.5 N=136.5 | |||||
Statistic | Value | F Value | Num DF | Den DF | Pr > F |
NOTE: F Statistic for Roy's Greatest Root is an upper bound. | |||||
Wilks' Lambda | 0.07412299 | 79.33 | 15 | 759.56 | <.0001 |
Pillai's Trace | 1.72523079 | 74.98 | 15 | 831 | <.0001 |
Hotelling-Lawley Trace | 4.22038912 | 77.11 | 15 | 514.23 | <.0001 |
Roy's Greatest Root | 1.76975157 | 98.04 | 5 | 277 | <.0001 |
The CANDISC Procedure
Canonical Correlation | Adjusted Canonical Correlation | Approximate Standard Error | Squared Canonical Correlation | Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) | Test of H0: The canonical correlations in the current row and all that follow are zero | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Eigenvalue | Difference | Proportion | Cumulative | Likelihood Ratio | Approximate F Value | Num DF | Den DF | Pr > F | |||||
1 | 0.799348 | 0.783917 | 0.021500 | 0.638957 | 1.7698 | 0.2595 | 0.4193 | 0.4193 | 0.07412299 | 79.33 | 15 | 759.56 | <.0001 |
2 | 0.775649 | . | 0.023723 | 0.601631 | 1.5102 | 0.5698 | 0.3578 | 0.7772 | 0.20530227 | 83.28 | 8 | 552 | <.0001 |
3 | 0.696163 | . | 0.030689 | 0.484643 | 0.9404 | 0.2228 | 1.0000 | 0.51535684 | 86.83 | 3 | 277 | <.0001 |
The CANDISC Procedure
Total Canonical Structure | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.083359 | 0.843849 | 0.477962 |
SCG | 0.854954 | 0.199347 | -0.465622 |
STR | 0.583251 | -0.469715 | 0.604287 |
LPR | 0.133626 | 0.355706 | 0.003222 |
PEG | 0.440575 | 0.161348 | 0.298804 |
Between Canonical Structure | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.090378 | 0.887777 | 0.451314 |
SCG | 0.885211 | 0.200282 | -0.419868 |
STR | 0.642178 | -0.501838 | 0.579453 |
LPR | 0.361022 | 0.932527 | 0.007582 |
PEG | 0.823339 | 0.292585 | 0.486319 |
Pooled Within Canonical Structure | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.074138 | 0.788348 | 0.507876 |
SCG | 0.808248 | 0.197959 | -0.525909 |
STR | 0.509611 | -0.431103 | 0.630813 |
LPR | 0.084055 | 0.235032 | 0.002422 |
PEG | 0.292871 | 0.112664 | 0.237311 |
The CANDISC Procedure
Total-Sample Standardized Canonical Coefficients | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.028593157 | 1.211446595 | 0.669693195 |
SCG | 1.228738665 | 0.307650672 | -0.867348590 |
STR | 0.735707307 | -0.758876669 | 0.944214104 |
LPR | 0.042390255 | 0.349032608 | 0.006043056 |
PEG | 0.380674325 | 0.072550140 | 0.304594602 |
Pooled Within-Class Standardized Canonical Coefficients | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.0194211037 | 0.8228412875 | 0.4548704112 |
SCG | 0.7851625906 | 0.1965884250 | -.5542347488 |
STR | 0.5086556530 | -.5246745601 | 0.6528137448 |
LPR | 0.0407095387 | 0.3351939353 | 0.0058034569 |
PEG | 0.3459378732 | 0.0659299550 | 0.2768004086 |
Raw Canonical Coefficients | |||
---|---|---|---|
Variable | Can1 | Can2 | Can3 |
STG | 0.028593157 | 1.211446595 | 0.669693195 |
SCG | 1.228738665 | 0.307650672 | -0.867348590 |
STR | 0.735707307 | -0.758876669 | 0.944214104 |
LPR | 0.042390255 | 0.349032608 | 0.006043056 |
PEG | 0.380674325 | 0.072550140 | 0.304594602 |
Class Means on Canonical Variables | |||
---|---|---|---|
CLUSTER | Can1 | Can2 | Can3 |
1 | -1.520363604 | -0.361951851 | -0.882501324 |
2 | 0.141975888 | -1.075513864 | 1.195253975 |
3 | 2.319402827 | -0.184733649 | -1.025564852 |
4 | -0.088492449 | 2.331300115 | 0.536167180 |
The MEANS Procedure
Analysis Variable : UNS | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
87 | 3.0689655 | 0.8182952 | 1.0000000 | 4.0000000 |
Analysis Variable : UNS | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
85 | 2.1764706 | 0.8476364 | 1.0000000 | 4.0000000 |
Analysis Variable : UNS | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
54 | 1.9074074 | 0.8302879 | 1.0000000 | 4.0000000 |
Analysis Variable : UNS | ||||
---|---|---|---|---|
N | Mean | Std Dev | Minimum | Maximum |
57 | 1.8771930 | 0.8674712 | 1.0000000 | 4.0000000 |
The ANOVA Procedure
Class Level Information | ||
---|---|---|
Class | Levels | Values |
CLUSTER | 4 | 1 2 3 4 |
Number of Observations Read | 283 |
---|---|
Number of Observations Used | 283 |
The ANOVA Procedure
Dependent Variable: UNS
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 3 | 70.8180930 | 23.6060310 | 33.50 | <.0001 |
Error | 279 | 196.6165360 | 0.7047188 | ||
Corrected Total | 282 | 267.4346290 |
R-Square | Coeff Var | Root MSE | UNS Mean |
---|---|---|---|
0.264805 | 35.88693 | 0.839475 | 2.339223 |
Source | DF | Anova SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
CLUSTER | 3 | 70.81809299 | 23.60603100 | 33.50 | <.0001 |
The ANOVA Procedure
The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for UNS
This test controls the Type I experimentwise error rate.
Alpha | 0.05 |
---|---|
Error Degrees of Freedom | 279 |
Error Mean Square | 0.704719 |
Critical Value of Studentized Range | 3.65515 |
Comparisons significant at the 0.05 level are indicated by ***. | ||||
---|---|---|---|---|
CLUSTER Comparison | Difference Between Means | Simultaneous 95% Confidence Limits | ||
1 - 2 | 0.8925 | 0.5616 | 1.2234 | *** |
1 - 3 | 1.1616 | 0.7857 | 1.5374 | *** |
1 - 4 | 1.1918 | 0.8220 | 1.5615 | *** |
2 - 1 | -0.8925 | -1.2234 | -0.5616 | *** |
2 - 3 | 0.2691 | -0.1085 | 0.6466 | |
2 - 4 | 0.2993 | -0.0722 | 0.6707 | |
3 - 1 | -1.1616 | -1.5374 | -0.7857 | *** |
3 - 2 | -0.2691 | -0.6466 | 0.1085 | |
3 - 4 | 0.0302 | -0.3818 | 0.4422 | |
4 - 1 | -1.1918 | -1.5615 | -0.8220 | *** |
4 - 2 | -0.2993 | -0.6707 | 0.0722 | |
4 - 3 | -0.0302 | -0.4422 | 0.3818 |
SUMMARY OF FINDINGS
- The G-plot shows the R-square value increases as more clusters are specified.
- There is a bend in the Elbow curve after the 4th cluster where the R-square value might be leveling off.
- Canonical discriminate analysis reduces the 12 clusters to 4 distinct clusters that are visualized in the scatter plot.
- The box plots shows that cluster 1 corresponds to the highest mean UNS (knowledge level) followed by clusters 2, 3 and 4. There is little variation in the mean UNS between clusters 3&4,2&3 and 2&4.
- The Tukey test shows that the clusters differed significantly in mean UNS for clusters 1&2, 1&3 and 1&4.
Reference
H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.