8.12 Train, Tune, and Test Datasets
20200607 The final data wrangling is to partition the dataset into three separate datasets: training, tuning, and test datasets. The training dataset will be used to fit a model to the data. The tuning dataset is used to the tune the parameters for the model building process. Whilst these observations are not directly modelled they do guide the model fitting process. The test dataset is then only used to assess the performance of the finally fit and tuned model. This will provide an unbiased estimate of the performance of the final model on new observations.
To build the datasets we will use a random selection process.
We are now ready to partition the dataset into two subsets. The first
is a 70% random sample for building the model (the
train
ing dataset) and the second is the remainder, used to
evaluate the performance of the model (the test
dataset).
PTR <- 0.7 # Proportion for training
PTU <- 0.15 # Proportion for tuning
PTE <- 0.15 # Proportion for testing
tr <- sample(nobs, PTR*nobs) %T>%
{length(.) %>% print()}
## [1] 70
## [1] 15
## [1] 15
## [1] 49 65 25 74 18 100
Any model building we do will be based on the 70% train
ing
dataset. Our model may then be quite good at predicting these
observations. The model’s performance on this data on which it was
trained will be a very optimistic (or biased) estimate of the true
performance of the model on other datasets. We might thus ask how will
the model perform when we use it to predict outcomes for other, yet
unseen, observations?
The testing dataset is a hold-out dataset in that it has not been used at all for building the model. When we apply the model to this dataset we would expect it to have a lesser performance (e.g., a higher error rate). This is what we will generally observe and we will see this in the following sections.
The overall error rate measured on the train
ing dataset
will be shown to be less than the error rate calculated on the
test
dataset. The error rate (or any performance measure
in general) calculated on the test
dataset is closer to
what we will obtain in general when we begin to use the model. It is
an unbiased estimate of the true performance of the model.
We also record the actual target values and the risks. These will be used in the evaluation of the performance of the models.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0