7.8 kmeans predict


Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to β€œpredict,” or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the β€œmodel” saved as a csv file.

Common usage:

ml predict kmeans iris.csv model.csv

The output will be something like:


General usage:

ml predict kmeans DATAFILE [MODELFILE]

The input data.csv file is required as the observations to be labelled (β€œpredicting” the label which is actually finding the closest centroid).

If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.

The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.

To save the output to file:

ml predict kmeans iris.csv model.csv > predict.csv

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0