10.15 Normalise Variables

20200912 To rename variables in a dataset we can use dplyr::rename_with() which can apply a function, like rattle::normVarNames(), to the variable names and replace those names with the result from the function. A tidy alternative is to use janitor::clean_names() with the option numerals=``"right" to replicate rattle::normVarNames().

The choice of variable naming style is suggested in Chapter 27. all variable names are lowercase with words separated by the underscore. This normalisation is useful when different upper/lower case conventions are intermixed inconsistently in names like Incm_tax_PyBl. Remembering how to capitalise when interactively exploring the data with thousands of such variables can be quite a cognitive load. Yet we often see such variable names arising in practise especially when we import data from databases which are often case insensitive.

The example below shows the transformation into the preferred normalised form.

# Normalise variable names.

library(janitor)      # Cleanup: clean_names().

##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"
ds %<>%

##  [1] "date"            "location"        "min_temp"        "max_temp"       
##  [5] "rainfall"        "evaporation"     "sunshine"        "wind_gust_dir"  
##  [9] "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9am" 
## [13] "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"    "pressure_9am"   
## [17] "pressure_3pm"    "cloud_9am"       "cloud_3pm"       "temp_9am"       
## [21] "temp_3pm"        "rain_today"      "risk_mm"         "rain_tomorrow"

Notice the use of the assignment pipe here as introduced in Chapter 2}. We will recall that the %>% operator pipes the left-hand data to the function on the right-hand side and then returns the result to the left-hand side overwriting the original contents of the memory referred to on the left-hand side.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0