10.37 Introducing Template Variables
20180721 A reference to the original dataset can be created
using a template (or generic) variable. The new variable will be called
ds
(short for dataset).
Both ds
and weatherAUS
will now reference the same
dataset within the computer’s memory. As we modify ds
those modifications will only affect the data referenced by
ds
. Effectively, an extra copy of the dataset in the
computer’s memory will start to grow as we change the data from its
original form. R avoids making copies of datasets unnecessarily and
so a simple assignment does not create a new copy. As modifications
are made to one or the other copy of a dataset then extra
memory will be used to store the columns that
differ between the datasets.
From here on we no longer refer to the dataset as weather
but as ds
. This allows the following analyses and
processing to be rather generic—turning the R code into a
template and so requiring only minor modification when used
with a different dataset assigned into ds
.
Often we will find that we can simply load a different dataset into
memory, store it as ds
and the remaining steps of our
analyses and processing will essentially work unchanged.
The first few steps of our template are then captured as creating the reference to the dataset and presenting our initial view of the dataset.
# Prepare for a templated analysis and processing.
dsname <- "weatherAUS"
ds <- get(dsname)
ds %<>% clean_names(numerals="right")
glimpse(ds)
## Rows: 226,868
## Columns: 24
## $ date <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008-…
## $ location <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "Alb…
## $ min_temp <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.1…
## $ max_temp <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9, …
## $ rainfall <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0,…
## $ evaporation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sunshine <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ wind_gust_dir <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, SW, …
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 44…
## $ wind_dir_9am <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW, W…
## $ wind_dir_3pm <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, NNW,…
## $ wind_speed_9am <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4,…
## $ wind_speed_3pm <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20, …
## $ humidity_9am <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 65…
## $ humidity_3pm <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43,…
## $ pressure_9am <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009.6…
## $ pressure_3pm <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008.2…
## $ cloud_9am <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, NA,…
## $ cloud_3pm <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, NA,…
## $ temp_9am <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3, …
## $ temp_3pm <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2, …
## $ rain_today <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Yes,…
## $ risk_mm <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2,…
## $ rain_tomorrow <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, Yes…
We are a little tricky here in recording the dataset name in the
variable dsname
and then using the function
base::get() to make a copy of the dataset reference and
link it to the generic variable
ds
. We could simply assign the data to ds
directly as we saw above. Either way the generic variable
ds
refers to the same dataset. The use of
base::get() allows us to be a little more generic in our
template.
The use of generic variables within a
template for the tasks we perform on each new dataset
will have obvious advantages but we need to be careful. A
disadvantage is that we may be working with several datasets and
accidentally overwrite previously processed datasets referenced using
the same generic variable (ds
). The
processing of the dataset might take some time and so accidentally
losing it is not an attractive proposition. Care needs to be taken to
avoid this.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0