9.5 Correlated Numeric Variables

20200814 It is often useful to identify highly correlated variables. Such variables will often record the same information but in different ways and often arise when we combine data from different sources.

The correlation is calculated by dplyr::select()ing the numeric columns from the dataset and passing that through to stats::cor(). This matrix of pairwise correlations is based on only the complete observations so that observations with missing values are ignored.

We set the upper triangle of the correlation matrix to NA’s as they are a mirror of the values in the lower triangle and thus redundant. We also set diag=TRUE to set the diagonals as NA since they will always be perfect correlations.

The processing continues by making all values positive using base::abs(). With conversion to base::data.frame() then to tibble::as_tibble() the dataset column names need to be reset appropriately using magrittr::set_colnames(). We dplyr::mutate() the dataset with a new column using dplyr::mutate(), reshape the dataset using tidyr::pivot_longer() from tidyr (Wickham, Vaughan, and Girlich 2024) and then omit missing correlations using data.table::na.omit(). Finally the rows are dplyr::arrange()’d with the highest absolute correlations appearing first.

# For the numeric variables generate a table of correlations

ds %>%
  select(all_of(numc)) %>%
  cor(use="complete.obs") %>%
  ifelse(upper.tri(., diag=TRUE), NA, .) %>% 
  abs() %>% 
  data.frame() %>%
  as_tibble() %>%
  set_colnames(numc) %>%
  mutate(var1=numc) %>% 
  tidyr::pivot_longer(var2, cor, -var1) %>% 
  na.omit() %>%
  arrange(-abs(cor)) %T>%
  print() ->
mc

That could do with some work and instead use corrr:

ds %>%
  correlate() %>%
  shave() %>%
  fashion()

ds %>%
  correlate() %>%
  rearrange() %>%
  rplot()

ds %>%
  corrr::correlate() %>%
  shave() %>%
  stretch() %>%
  filter(abs(r) > 0.90)

ds %>%
  correlate() %>%
  focus()

ds %>%
  correlate() %>%
  network_plot()    # Fail

References

Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2024. Tidyr: Tidy Messy Data. https://tidyr.tidyverse.org.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0