10.44 ID Variables
20180723 From our observations so far we note that the
variable (date
) acts as an identifier as does the variable
(location
). Given a date
and a
location
we have an observation of the remaining
variables. Thus we note that these two variables are so-called
identifiers. Identifiers would not usually be used as independent
variables for building predictive analytics models.
# Note any identifiers.
<- c("date", "location") id
We might get a sense of how this works with the following which will list a random sample of locations and how long the observations for that location have been collected.
%>%
ds[id] group_by(location) %>%
count() %>%
rename(days=n) %>%
mutate(years=round(days/365)) %>%
as.data.frame() %>%
sample_n(10)
## location days years
## 1 Witchcliffe 3648 10
## 2 Brisbane 3833 11
## 3 Melbourne 3833 11
## 4 Richmond 3649 10
## 5 Portland 3649 10
## 6 Albury 3680 10
## 7 Wollongong 3680 10
## 8 SydneyAirport 3649 10
## 9 Launceston 3680 10
## 10 GoldCoast 3680 10
The data for each location ranges in length from 4 years up to 9 years, though most have 8 years of data.
%>%
ds[id] group_by(location) %>%
count() %>%
rename(days=n) %>%
mutate(years=round(days/365)) %>%
ungroup() %>%
select(years) %>%
summary()
## years
## Min. : 6.000
## 1st Qu.:10.000
## Median :10.000
## Mean : 9.878
## 3rd Qu.:10.000
## Max. :11.000
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0
