10.46 Ignore Missing

20180723 We next remove any variable where all of the values are missing. There are none like this in the weather dataset but in general for other datasets with thousands of variables there may be some. Here we first count the number of missing values for each variable and then list the names of those variables that have no values.

# Identify variables with only missing values.

ds[vars] %>%
  sapply(function(x) x %>% is.na %>% sum) %>%
  equals(nrow(ds)) %>%
  which() %>%
  names() %T>%
  print() ->
missing
## character(0)
# Add them to the variables to be ignored for modelling.

ignore <- union(ignore, missing) %T>% print()
## [1] "date"     "location" "risk_mm"

It is also useful to identify those variables which are very sparse—that have mostly missing values. We can decide on a threshold of the proportion missing above which to ignore the variable as not likely to add much value to our analysis. For example, we may want to ignore variables with more than 70% of the values missing:

# Identify a threshold above which proportion missing is fatal.

missing.threshold <- 0.7

# Identify variables that are mostly missing.

ds[vars] %>%
  sapply(function(x) x %>% is.na() %>% sum()) %>%
  '>'(missing.threshold*nrow(ds)) %>%
  which() %>%
  names() %T>%
  print() ->
mostly
## character(0)
# Add them to the variables to be ignored for modelling.

ignore <- union(ignore, mostly) %T>% print()
## [1] "date"     "location" "risk_mm"


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0