7.9 kmeans normalise

20220225

For effective K-Means modelling the numeric data is best normalised. Otherwise a variable like income will swamp a variable like age when calculating the Euclidean distance between two points. For example, the difference between an income of 50,000 and 40,000 is 10,000. If we add in the distance between 50 years of age and 40, which is 10, we get 10,010, which is a rather trivial addition. After normalising, an income of 50,000 might become 0.6 and 40,000 might become 0.45, and so the distance is 0.15. Similarly, an age of 50 might be 0.6 and if 40 becomes 0.5 so that the distance is 0.1. They are then comparable numbers and when combined to get 0.25 we can see each makes its fair contribution.

Common usage of our command to normalise a numeric data file, column by column:

ml normalise kmeans iris.csv

This will produce the following output to stdout which can be redirected to a file or else piped on to train to train a clustering.

sepal_length,sepal_width,petal_length,petal_width
-0.898,1.016,-1.336,-1.311
-1.139,-0.132,-1.336,-1.311
-1.381,0.327,-1.392,-1.311
-1.501,0.098,-1.279,-1.311
-1.018,1.245,-1.336,-1.311
-0.535,1.933,-1.166,-1.049
-1.501,0.786,-1.336,-1.18
...

General usage:

ml normalise kmeans [DATAFILE]

Often we will save the resulting data to a file:

ml normalise kmeans iris.csv > norm.csv

Or it might form part of a pipeline:

ml normalise kmeans iris.csv |
    ml train kmeans

See Section 7.13 for an example of a dataset that actually does require normalisation.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0