7.8 kmeans predict


Having performed a cluster analysis we have effectively fit a model to the data or as others may describe it, we have trained a model from the data. The model can now be used to β€œpredict”, or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (a csv file) based on the β€œmodel” saved as a csv file.

Common usage:

ml predict kmeans iris.csv model.csv

The output will be something like:


General usage:

ml predict kmeans DATAFILE [MODELFILE]

The input data.csv file is required as the observations to be labelled (β€œpredicting” the label which is actually finding the closest centroid).

If no input model file is supplied (containing the centres representing the model and a label together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label (generally the last column) and the remaining columns are the centres.

The output is a csv file, with a header and a column for the label, named as such, as the last column, identifying the nearest centre to each point.

To save the output to file:

ml predict kmeans iris.csv model.csv > predict.csv

