14.13 Biased Estimate from the Training Dataset
We noted above that evaluating a model on the
dataset on which the model was built will result in overly optimistic
performance outcomes. We can compare the performance of the
randomForest::randomForest() model on the
presented above with that on the
training dataset. As
expected the performance on the
training dataset is wildly
optimistic. In fact it is common for the
randomForest (Breiman et al. 2018) model to predict perfectly over the
training dataset as we see from the confusion matrix.
<- predict(model, newdata=ds[tr, vars], type="class") predict_tr con(predict_tr, actual_tr)
## Predicted ## Actual No Yes ## No 76 3 ## Yes 14 7
Similarly Figure @ref(fig:memplate:rf_riskchart_tr) illustrates the
problem of evaluating a model based on the
dataset. Again we see perfect performance when we evaluate the model
training dataset. The performance line (the Recall
which is plotted as the green line) follows the best achievable which
is the grey line.
<- predict(model, newdata=ds[tr, vars], type="prob")[,2] pr_tr riskchart(pr_tr, actual_tr, risk_tr, title.size=14) + labs(title="Risk Chart - " %s+% mtype %s+% " - Training Dataset")
Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.