3.1 A Data Frame as a Dataset
20210103 A data frame is essentially a rectangular table
(or matrix) of data consisting of rows (observations) and
columns (variables). We can
base::print.data.frame() to view a table, here choosing the
first 10 observations of the first 6 variables of the ds
dataset.
## date location min_temp max_temp rainfall evaporation
## 1 2008-12-01 Albury 13.4 22.9 0.6 NA
## 2 2008-12-02 Albury 7.4 25.1 0.0 NA
## 3 2008-12-03 Albury 12.9 25.7 0.0 NA
## 4 2008-12-04 Albury 9.2 28.0 0.0 NA
## 5 2008-12-05 Albury 17.5 32.3 1.0 NA
## 6 2008-12-06 Albury 14.6 29.7 0.2 NA
## 7 2008-12-07 Albury 14.3 25.0 0.0 NA
## 8 2008-12-08 Albury 7.7 26.7 0.0 NA
## 9 2008-12-09 Albury 9.7 31.9 0.0 NA
## 10 2008-12-10 Albury 13.1 30.1 1.4 NA
Alternatively we might sample 10 random observations (dplyr::sample_n()) of 5 random variables (dplyr::select()):
# Display a random selection of observations and variables.
ds %>%
sample_n(10) %>%
select(sample(1:ncol(ds), 5)) %>%
print.data.frame()
## humidity_3pm min_temp wind_gust_speed max_temp rainfall
## 1 75 11.5 39 16.8 20.2
## 2 68 13.6 59 17.9 8.2
## 3 23 21.0 74 34.1 0.0
## 4 54 15.9 46 26.0 6.8
## 5 43 1.8 22 20.0 0.0
## 6 36 9.9 43 26.8 0.0
## 7 16 18.9 39 28.2 0.0
## 8 40 12.9 35 31.0 0.0
## 9 12 6.5 37 32.2 0.0
## 10 44 10.2 46 18.4 0.4
This tabular form (i.e., it has rows and columns) is common for data science and we refer to it as our dataset.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0