7.17 kmeans pipelines

20220316

As with all mlhub commands, a goal is to provide powerful combinations of commands through pipelines. We have seen this through the chapter already where we processed a csv file through a number of steps to normalise the columns, to then pipe the csv data into the train command followed by the predict command to output a csv file with each observation labelled with a cluster number. Below is collected sample pipelines that illustrate different data flows.

cat iris.csv |
    ml train kmeans 3 |
    ml predict kmeans iris.csv

The output will be similar to the following:

sepal_length,sepal_width,petal_length,petal_width,label
5.0,3.6,1.4,0.2,1
7.7,3.8,6.7,2.2,0
6.1,3.0,4.9,1.8,2
5.4,3.7,1.5,0.2,2
...

To visualise the final clustering, to popup a display of the clustering result:

cat iris.csv |
    ml train kmeans 3 |
    ml predict kmeans iris.csv |
    ml visualise kmeans

We can include within the pipeline a normalise operation:

cat wine.csv | 
  ml normalise kmeans |
  tee norm.csv |
  ml train kmeans 4 |
  ml predict kmeans norm.csv |
  mlr --csv cut -f label |
  paste -d"," wine.csv -

After normalising the input dataset the result is saved to a file norm.csv using tee whilst piping the same data on to the next command to train a clustering. We save to file since we’d like to predict the clusters for each of the normalised observations, then map them back to the original observations. This is accomplished using a combination of mlr to cut the label column from the csv output from the predict command, and then we paste that label column to the original wine.csv.

Once we have the resulting model and the predictions made on the original data, we can visualise the result as part of a pipeline, whilst also using tee to save the clustering to file:

cat wine.csv | 
  ml normalise kmeans |
  tee norm.csv |
  ml train kmeans 4 |
  ml predict kmeans norm.csv |
  mlr --csv cut -f label |
  paste -d"," wine.csv - |
  tee clustering.csv |
  ml visualise kmeans

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0