10.37 Introducing Template Variables

20180721 A reference to the original dataset can be created using a template (or generic) variable. The new variable will be called ds (short for dataset).

# Take a copy of the dataset into a generic variable.

ds <- weatherAUS

Both ds and weatherAUS will now reference the same dataset within the computer’s memory. As we modify ds those modifications will only affect the data referenced by ds. Effectively, an extra copy of the dataset in the computer’s memory will start to grow as we change the data from its original form. R avoids making copies of datasets unnecessarily and so a simple assignment does not create a new copy. As modifications are made to one or the other copy of a dataset then extra memory will be used to store the columns that differ between the datasets.

From here on we no longer refer to the dataset as weather but as ds. This allows the following analyses and processing to be rather generic—turning the R code into a template and so requiring only minor modification when used with a different dataset assigned into ds.

Often we will find that we can simply load a different dataset into memory, store it as ds and the remaining steps of our analyses and processing will essentially work unchanged.

The first few steps of our template are then captured as creating the reference to the dataset and presenting our initial view of the dataset.

# Prepare for a templated analysis and processing.

dsname <- "weatherAUS"
ds     <- get(dsname)
ds %<>% clean_names(numerals="right")
glimpse(ds)
## Rows: 226,868
## Columns: 24
## $ date            <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008-…
## $ location        <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "Alb…
## $ min_temp        <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.1…
## $ max_temp        <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9, …
## $ rainfall        <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0,…
## $ evaporation     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sunshine        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ wind_gust_dir   <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, SW, …
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 44…
## $ wind_dir_9am    <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW, W…
## $ wind_dir_3pm    <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, NNW,…
## $ wind_speed_9am  <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4,…
## $ wind_speed_3pm  <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20, …
## $ humidity_9am    <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 65…
## $ humidity_3pm    <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43,…
## $ pressure_9am    <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009.6…
## $ pressure_3pm    <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008.2…
## $ cloud_9am       <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, NA,…
## $ cloud_3pm       <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, NA,…
## $ temp_9am        <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3, …
## $ temp_3pm        <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2, …
## $ rain_today      <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Yes,…
## $ risk_mm         <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2,…
## $ rain_tomorrow   <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, Yes…

We are a little tricky here in recording the dataset name in the variable dsname and then using the function base::get() to make a copy of the dataset reference and link it to the generic variable ds. We could simply assign the data to ds directly as we saw above. Either way the generic variable ds refers to the same dataset. The use of base::get() allows us to be a little more generic in our template.

The use of generic variables within a template for the tasks we perform on each new dataset will have obvious advantages but we need to be careful. A disadvantage is that we may be working with several datasets and accidentally overwrite previously processed datasets referenced using the same generic variable (ds). The processing of the dataset might take some time and so accidentally losing it is not an attractive proposition. Care needs to be taken to avoid this.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0