10.68 Identify Variable Types

20180726 Metadata is data about the data. We now record data about our dataset that we use later in further processing and analysing our data. In one sense the metadata is simply a convenient store.

We identify the variables that will be used to build analytic models that provide different kinds of insight into our data. Above we identified the variable roles such as the target, a risk variable and the ignored variables. From an analytic modelling perspective we identify variables that are the model inputs. We record then both as a vector of characters (the variable names) and a vector of integers (the variable indicies).

inputs <- setdiff(vars, target) %T>% print()

##  [1] "min_temp"        "max_temp"        "rainfall"        "evaporation"    
##  [5] "sunshine"        "wind_gust_dir"   "wind_gust_speed" "wind_dir_9am"   
##  [9] "wind_dir_3pm"    "wind_speed_9am"  "wind_speed_3pm"  "humidity_9am"   
## [13] "humidity_3pm"    "pressure_9am"    "cloud_9am"       "cloud_3pm"      
## [17] "rain_today"

The integer indices are determined from the base::names() of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply() to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.

inputi <- sapply(inputs,
                 function(x) which(x == names(ds)),
                 USE.NAMES=FALSE)
inputi

##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 18 19 22

For convenience we record the number of observations:

nobs <- nrow(ds) %T>% print()

## [1] 220094

Here we simply report on the dimensions of various data subsets primarily to confirm the dataset appear as we expect:

dim(ds)

## [1] 220094     24

dim(ds[vars])

## [1] 220094     18

dim(ds[inputs])

## [1] 220094     17

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0