7.2 kmeans quick start

20220104 Below we share simple examples, demonstrating the various commands supported by the package.

The sample training dataset iris.csv can be downloaded directly from GitHub:

wget https://raw.githubusercontent.com/acwkayon/kmeans/master/iris.csv

We may need to first normalise the data from which the model is to be trained, saving the normalised dataset, for training the model, by redirecting (>) standard output (stdout):

ml normalise kmeans iris.csv > norm.csv

A cluster analysis of the normalised named numeric columns of the csv file is then undertaken, saving the resulting model to file:

ml train kmeans 3 norm.csv > model.csv

Note that each time we train the model we get a slightly different model. This is expected and is common of model building where some random choices must be made. For the k-means algorithm a random selection of starting points for the k means is made.

A video can be constructed and displayed to show the iterations of the training algorithm over the data:

ml train kmeans 3 norm.csv --view

To save the video to file use the -m or --movie option naming the file into which to save the mp4 video:

ml train kmeans 3 norm.csv -m iris.mp4

The model can then be used by predict to assign each observation in a csv file to a cluster (effectively, to it’s nearest centroid):

ml predict kmeans norm.csv model.csv > clusters.csv

The clustering can then be visualised:

ml visualise kmeans clustering.csv

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0