10.42 Special Case Variable Name Transformations

20180721 When reviewing the variables of a dataset we often notice other changes that could be made to the variable names. This might be to simplify the variables or to clarify the meaning of the variable. The string processing functions provided by stringr (Wickham 2023) come in handy for such processing.

In the following example we remove the prefix of the variable names where we identify that the prefix consists of all characters up to the first underscore. This is useful where a dataset has prefixed each variable by a sequential number or by some other code and we have no real use of such a prefix in our processing.

names(ds) %<>% str_replace("^[^_]*_", "")

This will take a variable name like ab123_tax_payable and convert it to tax_payable.

str_replace("ab123_tax_payable", "^[^_]*_", "") 
## [1] "tax_payable"

The odd looking characters in the argument to stringr::str_replace() are a . Regular expressions are a very powerful concept and can get quite complex. The reader is referred to the many resources on-line that cover regular expressions. The regular expression is a pattern used to match some part of the variable name. The pattern begins with ^ which anchors the match to the beginning of the variable name. This can be followed by zero or more characters (*) that do not match the underscore ([^_])—the * specifies that the preceding pattern can be repeated zero or more times. The preceding pattern here is actually a list of characters included between square brackets. Since this list begins with ^ the listed characters are excluded from the matching. That is, the pattern preceding the * will match any character that is not an underscore. The third component of the match is then an actual underscore. Combined this regular expression matches any sequence (including an empty sequence) of characters (except for an underscore) that is at the beginning of the variable name and followed by an underscore.

The next argument to stringr::str_replace() is the replacement string. In this case we are replacing the matched pattern with an empty string.

The example here is simply one example of very many possible transformations we become used to in cleaning our datasets. The aim in transforming the variable names is to make then easier to use and to understand, both for ourselves and for others.

References

———. 2023. Stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0