7.15 kmeans example wine normalised cluster
20211231
The different variables of the wine dataset have quite different scales and so normalisation is appropriate to better suit the distance calculations used by k-means discussed in Section 7.13.
As a pipeline, we normalise the data,
tee it to norm.csv
, train
3 clusters, predict and save the clusters as
the column label in the file wine.pr
using mlr:
cat wine.csv |
ml normalise kmeans |
tee norm.csv |
ml train kmeans 3 |
ml predict kmeans norm.csv |
mlr --csv cut -f label > wine.pr
These predictions of cluster membership are then compared to the
original wine.data
file class, though noting that a cluster
analysis is not a supervised classification task as such. We
cat the original data file, cut
the first field, use awk to add a column name of
class as the header, paste the class to the
predictions from wine.pr
, remove the header by taking the
tail from line 2, sorting the
resulting rows and counting the number of unique
rows:
cat wine.data |
cut -d"," -f 1 |
awk 'NR ==1{ print "class "} {print}' |
paste -d"," - wine.pr |
tail +2 |
sort |
uniq -c
Now cluster 0 predominately matches class 1, cluster 2 class 2 and cluster 1 class 3:
59 1,0
10 2,0
9 2,1
52 2,2
2 3,0
46 3,1
The plot is interestingly more dispersed (c.f. Section 7.14. See Section 7.16 for an explanation.
cat wine.pr | paste -d"," wine.csv - | ml visualise kmeans
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0
