7.8 kmeans predict
Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to “predict,” or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the “model” saved as a csv file.
ml predict kmeans iris.csv model.csv
The output will be something like:
sepal_length,sepal_width,petal_length,petal_width,label 5.1,3.5,1.4,0.2,0 4.9,3.0,1.4,0.2,0 ... 6.9,3.1,4.9,1.5,2 5.5,2.3,4.0,1.3,2 .. 7.3,2.9,6.3,1.8,1 6.7,2.5,5.8,1.8,1 ...
ml predict kmeans DATAFILE [MODELFILE]
data.csv file is required as the observations to be
labelled (“predicting” the label which is actually finding the closest
If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.
The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.
To save the output to file:
ml predict kmeans iris.csv model.csv > predict.csv
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0