9.3 Random Sample
20200317
A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-10-21 Portland 8.5 22.4 0 4.6 12
## 2 2014-02-21 GoldCoast 24.7 29.8 0 NA NA
## 3 2017-06-04 PerthAirport 12.8 22.6 0 1.8 9.1
## 4 2012-09-16 Richmond 4.6 23.3 0 NA NA
## 5 2019-07-21 WaggaWagga 2.5 17 0 NA NA
## 6 2018-07-07 AliceSprings 0.3 17.4 0 NA NA
## 7 2023-03-01 Ballarat 9.9 21.7 0.2 NA NA
## 8 2018-10-10 NorfolkIsland 16.3 21 0.2 3.2 NA
## 9 2009-02-04 Nuriootpa 12.8 35.6 0 11.4 13.1
## 10 2011-07-08 WaggaWagga -4.7 8.3 0 1.2 2.4
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
The next time you randomly sample the dataset the resulting sample will be different:
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-04-27 Watsonia 9.8 15 4.4 2.8 3.3
## 2 2012-09-12 SydneyAirport 10.8 22.9 0 3.8 10.2
## 3 2019-05-21 Katherine NA NA NA NA NA
## 4 2022-07-09 Bendigo 1.5 12.7 0.4 NA NA
## 5 2013-11-09 Newcastle 19.5 36.2 0 NA NA
## 6 2023-03-24 WaggaWagga 13.2 28.6 16.6 NA NA
## 7 2014-09-06 Penrith 11.2 17.4 1.2 NA NA
## 8 2012-04-23 Perth 11.3 23.1 0 5 10.5
## 9 2020-07-28 Walpole 10.4 16 20 NA NA
## 10 2021-06-07 Dartmoor 12.5 18.4 2.2 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
To ensure the sample random sample each time use base::set.seed():
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-12-14 MelbourneAirport 13.4 22 0 9.4 2.5
## 2 2014-02-02 SalmonGums 16.4 26.2 7.2 NA NA
## 3 2014-11-10 MountGambier 6.8 20.4 0 5.6 9.9
## 4 2020-12-05 BadgerysCreek 16.4 24.7 NA NA NA
## 5 2022-10-13 SydneyAirport 15.4 22.3 0 3.2 3.4
## 6 2010-04-02 Adelaide 16.1 26.7 0 NA 10.8
## 7 2010-10-08 Walpole 11 27.4 0 NA NA
## 8 2014-01-23 Bendigo 13.8 33.9 0 8.2 NA
## 9 2018-05-03 MountGinini 5.4 13.7 0 NA NA
## 10 2018-12-06 Witchcliffe 11.9 20.5 15.4 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-12-14 MelbourneAirport 13.4 22 0 9.4 2.5
## 2 2014-02-02 SalmonGums 16.4 26.2 7.2 NA NA
## 3 2014-11-10 MountGambier 6.8 20.4 0 5.6 9.9
## 4 2020-12-05 BadgerysCreek 16.4 24.7 NA NA NA
## 5 2022-10-13 SydneyAirport 15.4 22.3 0 3.2 3.4
## 6 2010-04-02 Adelaide 16.1 26.7 0 NA 10.8
## 7 2010-10-08 Walpole 11 27.4 0 NA NA
## 8 2014-01-23 Bendigo 13.8 33.9 0 8.2 NA
## 9 2018-05-03 MountGinini 5.4 13.7 0 NA NA
## 10 2018-12-06 Witchcliffe 11.9 20.5 15.4 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0