7.15 kmeans example wine normalised cluster

20211231 The different variables of the wine dataset have quite different scales and so normalisation is appropriate to better suit the distance calculations used by k-means discussed in Section 7.13.

As a pipeline, we normalise the data, tee it to norm.csv, train 3 clusters, predict and save the clusters as the column label in the file wine.pr using mlr:

cat wine.csv | 
  ml normalise kmeans |
  tee norm.csv |
  ml train kmeans 3 |
  ml predict kmeans norm.csv |
  mlr --csv cut -f label > wine.pr

These predictions of cluster membership are then compared to the original wine.data file class, though noting that a cluster analysis is not a supervised classification task as such. We cat the original data file, cut the first field, use awk to add a column name of class as the header, paste the class to the predictions from wine.pr, remove the header by taking the tail from line 2, sorting the resulting rows and counting the number of unique rows:

cat wine.data |
  cut -d"," -f 1 |
  awk 'NR ==1{ print "class "} {print}' |
  paste -d"," - wine.pr |
  tail +2 |
  sort |
  uniq -c

Now cluster 0 predominately matches class 1, cluster 2 class 2 and cluster 1 class 3:

 59 1,0
 10 2,0
  9 2,1
 52 2,2
  2 3,0
 46 3,1

The plot is interestingly more dispersed (c.f. Section 7.14. See Section 7.16 for an explanation.

cat wine.pr | paste -d"," wine.csv - | ml visualise kmeans

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0