7.2 kmeans quick start
Below we share simple examples, demonstrating the various commands supported by the package.
The sample training dataset iris.csv can be downloaded directly from GitHub:
We may need to first normalise the data from which the model is to be
trained, saving the normalised dataset, for training the model, by
>) standard output (stdout):
ml normalise kmeans iris.csv > norm.csv
The iris data, consisting of columns recording sepal and petal length and width, which are essentially on the same scale, does not really need to be normalised.
A cluster analysis of the normalised named numeric columns of the csv file is then undertaken, saving the resulting model to file:
ml train kmeans 3 norm.csv > model.csv
Each time we train the model we get a slightly different model. This is expected and is common of model building where some random choices must be made. For the k-means algorithm a random selection of starting points for the k means is made.
A video can be constructed to display the iterations of the training algorithm over the data:
ml train kmeans 3 norm.csv --view
To save the video to file use the
--movie option naming the
file into which to save the mp4 video:
ml train kmeans 3 norm.csv -m iris.mp4
The model can then be used by predict to assign each observation in a csv file to a cluster (effectively, to it’s nearest centroid):
ml predict kmeans norm.csv model.csv > clustering.csv
The clustering can then be visualised:
ml visualise kmeans clustering.csv
As a pipeline:
wget -qO- https://raw.githubusercontent.com/acwkayon/kmeans/master/iris.csv | ml normalise kmeans | tee norm.csv | ml train kmeans 3 --view --movie iris.mp4 | ml predict kmeans norm.csv | ml visualise kmeans
Note the use of tee to save the normalsied data to
norm.csv within the pipeline so that the file can be later used
within the pipeline. Note that this however does not close the file in
time for the predict command to read it and results in an empty file
being read, so the predict command checks for this situation and waits
for the file to be ready.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0