7.6 kmeans train

20220104 You can supply your own dataset as a csv file with a column for each variable, with at least one column. , a row for each instance, and at least your chosen K number of rows. The data is assumed to be numeric and to have been normalised (e.g, converted to numbers in the range -1 to 1) so that no variable dominates any other variable in calculating distances (a distance of 10,000 for incomes of $40,000 and $50,000 swamps a distance of 10 for ages of 40 and 50).

Common usage:

ml train kmeans 3 iris.csv > model.csv

General usage:

$ ml train kmeans --help

Usage: train [OPTIONS] K [FILENAME]

  Train a K-means cluster model from observataions in the DATAFILE.

  K must be specified as the number of clusters to train.

  FILENAME is optional and if not supplied the data is obtained from STDIN.
  It is a CSV file of named numeric columns, generally expected to be
  NORMALISEd.

  The output to STDOUT is a k-means cluster model represented as a CSV file
  with each of the k-means (the centers or centroids of each cluster) on a
  single line, together with a uique label (0..k-1) to identify the cluster.

Options:
  -o, --output FILENAME  Filename of the CSV file to save model, or to STDOUT.
  -m, --movie PATH       Filename of the movie file to save if desired.
  -v, --view             Popup a movie viewer to visualise the algorithm.
  --help                 Show this message and exit.

The output from train is also a csv file with k rows and a header row. Each of the k rows correspond to a “discovered” or “fit” cluster. This is the trained model. The cluster is represented as the central point of the cluster, calculated as the “mean” or “average” of each of the variables in the dataset across all the points/people in that cluster.

We can use these centroids to place (predict) new observations (people) into one of the k clusters.

The parameter k needs to be provided up front—i.e., we need to guess a suitable value for the number of clusters. There are other tools that can assist in deciding on a good value for k.

With no csv specified on the command line the csv data (with a header row) is read from standard input. This allows the command to be part of a pipeline of commands, whereby the training data could be piped from another operation.

The default output written to standard output is a csv of the centres, with a cluster label appended to each row, and a header row with the cluster label column named label. Standard output can be redirected to a file or consumed through a pipeline.

If no csv output is specified then the output is always to the terminal, irrespective of whether a mp4 is also output or whether --view is requested.

The output might look something like:

$ ml train kmeans 3 iris.csv
sepal_length,sepal_width,petal_length,petal_width,label
6.85,3.07,5.71,2.05,0
5.00,3.42,1.46,0.24,1
5.88,2.74,4.38,1.43,2

Note that the algorithm initialises it’s starting point (the first k centres) randomly, and so the model that is built each time may be quite different.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0