3.1 A Data Frame as a Dataset

20210103 A data frame is essentially a rectangular table (or matrix) of data consisting of rows (observations) and columns (variables). We can base::print.data.frame() to view a table, here choosing the first 10 observations of the first 6 variables of the ds dataset.

# Display the table structure of the ingested dataset.

ds[1:10,1:6] %>% print.data.frame()
##          date location min_temp max_temp rainfall evaporation
## 1  2008-12-01   Albury     13.4     22.9      0.6          NA
## 2  2008-12-02   Albury      7.4     25.1      0.0          NA
## 3  2008-12-03   Albury     12.9     25.7      0.0          NA
## 4  2008-12-04   Albury      9.2     28.0      0.0          NA
## 5  2008-12-05   Albury     17.5     32.3      1.0          NA
## 6  2008-12-06   Albury     14.6     29.7      0.2          NA
## 7  2008-12-07   Albury     14.3     25.0      0.0          NA
## 8  2008-12-08   Albury      7.7     26.7      0.0          NA
## 9  2008-12-09   Albury      9.7     31.9      0.0          NA
## 10 2008-12-10   Albury     13.1     30.1      1.4          NA

Alternatively we might sample 10 random observations (dplyr::sample_n()) of 5 random variables (dplyr::select()):

# Display a random selection of observations and variables.

ds %>%
  sample_n(10) %>%
  select(sample(1:ncol(ds), 5)) %>%
  print.data.frame()
##    wind_gust_speed max_temp rainfall wind_speed_3pm rain_today
## 1               30     25.1      0.2             11         No
## 2               72     30.7      1.0             33         No
## 3               56     14.9      1.6             20        Yes
## 4               33     28.8      0.0             20         No
## 5               37     31.3      0.0             20         No
## 6               35     35.7      0.0             15         No
## 7               24     15.5      0.0             15         No
## 8               22     22.5      0.0              6         No
## 9               31     20.0      0.4              4         No
## 10              35     21.9      0.0             17         No

This tabular form (i.e., it has rows and columns) is common for data science and we refer to it as our dataset.



Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.