7.2 kmeans quick start


Below we share simple examples, demonstrating the various commands supported by the package.

The sample training dataset iris.csv can be downloaded directly from GitHub:

wget https://raw.githubusercontent.com/acwkayon/kmeans/master/iris.csv

We may need to first normalise the data from which the model is to be trained, saving the normalised dataset, for training the model, by redirecting (>) standard output (stdout):

ml normalise kmeans iris.csv > norm.csv

The iris data, consisting of columns recording sepal and petal length and width, which are essentially on the same scale, does not really need to be normalised.

A cluster analysis of the normalised named numeric columns of the csv file is then undertaken, saving the resulting model to file:

ml train kmeans 3 norm.csv > model.csv

Each time we train the model we get a slightly different model. This is expected and is common of model building where some random choices must be made. For the k-means algorithm a random selection of starting points for the k means is made.

A video can be constructed to display the iterations of the training algorithm over the data:

ml train kmeans 3 norm.csv --view

To save the video to file use the -m or --movie option naming the file into which to save the mp4 video:

ml train kmeans 3 norm.csv -m iris.mp4

The model can then be used by predict to assign each observation in a csv file to a cluster (effectively, to it’s nearest centroid):

ml predict kmeans norm.csv model.csv > clustering.csv

The clustering can then be visualised:

ml visualise kmeans clustering.csv

As a pipeline:

wget -qO- https://raw.githubusercontent.com/acwkayon/kmeans/master/iris.csv |
    ml normalise kmeans |
    tee norm.csv |
    ml train kmeans 3 --view --movie iris.mp4 |
    ml predict kmeans norm.csv |
    ml visualise kmeans

Note the use of tee to save the normalsied data to norm.csv within the pipeline so that the file can be later used within the pipeline. Note that this however does not close the file in time for the predict command to read it and results in an empty file being read, so the predict command checks for this situation and waits for the file to be ready.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0