10.29 Re-Scale Data in Rattle

20240802

Different model builders require different characteristics of the data from which the models will be built. For example, when building a clustering using any kind of distance measure, we may need to normalise the data. Otherwise, a variable like income will overwhelm a variable like age, when calculating distances, considering that a distance of 10 ``years’’ may be more significant than a distance of $10,000, yet, 10,000 swamps 10 when they are added together, as would be the case when calculating distances.

In these situations we will want to normalise our data. The types of normalisations (available through the Rescale feature of the Transform tab) we may want to perform include re-centering and rescaling our data to be around zero (Recenter), rescaling our data to be in the range from 0 to 1 (Scale [0,1]), covert the numbers into a rank ordering (Rank), and finally, to do a robust rescaling around zero using the median (-Median/MAD).

For a Rank ordering if all the values are different and there are no NAs then the rank is simply from 1 to the number of observations. Data with equal values (ties) need to be resolved in some form. The R function base::rank() provides several approaches to resolving ties. The default is to assign, for all those with a common value, the average rank for those with this common value. This in some sense retains a kind of consistent distribution for the ranks. Other approaches can, for example, ensure a unique number for every value, even if they were originally the same. NAs are handled specially, being placed at the end of the ranking, incrementing for each one.

The approach Rattle takes to normalising (and to transforming) our data keeps the original data without modification. Instead, a new variable is created with a prefix added to the variable’s name that indicates the kind of transformation. Prefixes might be RRC_, R01_, RMD_, RLG_, and R10_.

We can see the effect of the different normalisations in comparing the summary distributions in the second page of the display panel. It is also informative to compare the distributions visually through the Visual feature of the Explore tab.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0