7.9 kmeans normalise

UNDER DEVELOPMENT 20220104 For effective K-Means modelling the numeric data is best normalised. Otherwise a variable like income will swamp a variable like age when calculating the Euclidean distance between two points. For example, the difference between an income of 50,000 and 40,000 is 10,000. If we add in the distance between 50 years of age and 40, which is 10, we get 10,010, which is a rather trivial addition. After normalising, an income of 50,000 might become 0.6 and 40,000 might become 0.45, and so the distance is 0.15. Similarly, an age of 50 might be 0.6 and if 40 becomes 0.5 so that the distance is 0.1. They are then comparable numbers.

Common usage to normalise a numeric data file, column by column:

ml normalise kmeans iris.csv > norm.csv

General usage:

$ ml normalise kmeans --help

Usage: normalise [OPTIONS] [CSVFILE]

  The DATAFILE is a csv format of named columns of numeric data
  to be normalised.

Options:
  --help  Show this message and exit.

This will produce the following output to stdout which can be redirected to a file or else piped on to train to train a clustering.

sepal_length,sepal_width,petal_length,petal_width
-0.898,1.016,-1.336,-1.311
-1.139,-0.132,-1.336,-1.311
-1.381,0.327,-1.392,-1.311
-1.501,0.098,-1.279,-1.311
-1.018,1.245,-1.336,-1.311
-0.535,1.933,-1.166,-1.049
-1.501,0.786,-1.336,-1.18
...


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0