7.8 kmeans predict

20220104

Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to “predict”, or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the “model” saved as a csv file.

Common usage:

ml predict kmeans iris.csv model.csv

The output will be something like:

sepal_length,sepal_width,petal_length,petal_width,label
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
...
6.9,3.1,4.9,1.5,2
5.5,2.3,4.0,1.3,2
..
7.3,2.9,6.3,1.8,1
6.7,2.5,5.8,1.8,1
...

General usage:

ml predict kmeans DATAFILE [MODELFILE]

The input data.csv file is required as the observations to be labelled (“predicting” the label which is actually finding the closest centroid).

If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.

The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.

To save the output to file:

ml predict kmeans iris.csv model.csv > predict.csv

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0