7.8 kmeans predict
20220104
Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to βpredictβ, or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the βmodelβ saved as a csv file.
Common usage:
The output will be something like:
sepal_length,sepal_width,petal_length,petal_width,label
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
...
6.9,3.1,4.9,1.5,2
5.5,2.3,4.0,1.3,2
..
7.3,2.9,6.3,1.8,1
6.7,2.5,5.8,1.8,1
...
General usage:
The input data.csv
file is required as the observations to be
labelled (βpredictingβ the label which is actually finding the closest
centroid).
If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.
The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.
To save the output to file:
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0