9.6 Missing Values in Rattle

20250812

Missing values present challenges to data mining and modelling in general. There can be many reasons for missing values, including the fact that the data is hard to collect, and so not always available (e.g., results of an expensive medical test), or that it is simply not recorded because it is in fact 0 (e.g., spouse income for someone without a spouse). Knowing why the data is missing is important in deicing how to deal with the missing value.

The Missing feature of the Explore tab provides various analyses of missing values in our dataset.

A particularly useful summary of the missing values is the Patterns of Missing Data - Textual page. This uses mice::md.pattern() to generate a table of the missing patterns (combinations of variables with missing values) like the following:

This table of patterns of missing values is presented with the variables from the dataset listed along the top. Each row corresponds to a pattern of missing values of the variables. A 1 indicates a value is present whereas a 0 indicates a value is missing, and the pattern covers a set of observations.

The first (left hand) column is the count of the observations in the dataset that have this pattern of variables with missing values. The sum of this column (not shown in the output) will be the total number of observations in our dataset.

The final (right hand) column is the count of the variables with missing values in dataset for this particular pattern. So the first row corresponds to observations with no missing values for any variables. The right hand sum is 0, as no variables have missing values.

The final (bottom) row is the count of the observations with missing values over the whole dataset for each particular variable. The total number of missing values over the whole dataset is then the sum of this row and is recorded at the bottom right (560).

Generally, the first row records the number of entities that have no missing values, as is the case here, where 1575 rows are complete.

The second row corresponds to a pattern of missing values for just the variable Age. There are 39 observations that have just Age missing (and there are 42 observations that have Age missing, overall). This particular row’s pattern has just a single variable missing, as indicated by the 1 in the final column.

The final row indicates that there are, for example, 37 missing values for the variable Marital, and that there are 560 missing values altogether in this dataset.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0