10.46 Ignore IDs and Outputs

20180723 The identifiers and any risk variable (which is an output variable) should be ignored in any predictive modelling. Always watch out for treating output variables as inputs to modelling—this is a surprisingly common trap for beginners. We will build a vector of the names of the variables to ignore. Above we have already recorded the id variables and (optionally) the risk. Here we join them together into a new vector using generics::union() which performs a set union operation—that is, it joins the two arguments together and removes any repeated variables.

# Initialise ignored variables: identifiers and risk.

ignore <- union(id, risk) %T>% print()
## [1] "date"     "location" "risk_mm"

We might also check for any variable that has a unique value for every observation. These are often identifiers and if so they are candidates for ignoring. We select the vars from the dataset and pipe through to base::sapply() for any variables having only unique values. In our case there are no further candidate identifiers. as indicated by the empty result, character(0).

# Heuristic for candidate indentifiers to possibly ignore.

ds[vars] %>%
  sapply(function(x) x %>% unique() %>% length()) %>%
  equals(nrow(ds)) %>%
  which() %>%
  names() %T>%
  print() ->
ids
## character(0)
# Add them to the variables to be ignored for modelling.

ignore <- union(ignore, ids) %T>% print()
## [1] "date"     "location" "risk_mm"


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0