20.2 Decision Trees Modelling Setup

20200815

For the rattle::weatherAUS dataset we similarly define the following template variables (Graham J. Williams 2017) used for predictive modelling. See Chapter 8 for details.

risk   <- "risk_mm"
id     <- c("date", "location")
ignore <- c(risk, id)
vars   <- setdiff(vars, ignore)
inputs <- setdiff(vars, target)

form   <- formula(target %s+% " ~ .")

ds[vars] %<>% na.roughfix()

SPLIT <- c(0.70, 0.15, 0.15)

nobs %>% sample(SPLIT[1]*nobs)                               -> tr
nobs %>% seq_len() %>% setdiff(tr) %>% sample(SPLIT[2]*nobs) -> tu
nobs %>% seq_len() %>% setdiff(tr) %>% setdiff(tu)           -> te

ds %>% slice(tr) %>% pull(target) -> actual_tr
ds %>% slice(tu) %>% pull(target) -> actual_tu
ds %>% slice(te) %>% pull(target) -> actual_te

ds %>% slice(tr) %>% pull(risk) -> risk_tr  
ds %>% slice(tu) %>% pull(risk) -> risk_tu  
ds %>% slice(te) %>% pull(risk) -> risk_te  

The 226,868 observations from the dataset have been randomly partitioned into a training dataset with 158,807 observations, a tuning dataset with 34,030 observations, and a testing dataset with 34,031 observations. The target variable (rain_tomorrow) has the classes: No (177939), Yes (48929).

References

Williams, Graham J. 2017. The Essentials of Data Science: Knowledge Discovery Using r. The r Series. CRC Press.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0