10.17 Impute Mean/Media/Mode

Often a simple, if not always satisfactory, choice for missing values that are known not to be zero is to use some ``central’’ value of the variable. This is often the mean, median, or mode, and thus usually has limited impact on the distribution. We might choose to use the mean, for example, if the variable is otherwise generally normally distributed (and in particular does not have any skewness). If the data does exhibit some skewness though (e.g., there are a small number of very large values) then the median might be a better choice. The mode can also be used for numeric variables when a single value is extremely dominant (e.g., number of stripes on a given snake species, where there is a dominant number that can serve as the default, and the number of bones in a vertebrate, similarly.

For categoric variables, there is, of course, no mean nor median, and so in such cases we might choose to use the mode (the most frequent value) as the default to fill in for the otherwise missing values. The mode can also be used for numeric variables.

Whilst this is a simple and computationally quick approach, it is a very blunt approach to imputation and can lead to poor performance from the resulting models.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0