9.3 Random Sample
20200317
A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:
%>% sample_frac(0.2) ds
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2018-08-06 Williamtown 4.9 16.7 0 NA NA
## 2 2023-03-11 MountGambier 8.3 20.9 0.2 NA NA
## 3 2010-01-25 Ballarat 10 27.9 0 NA NA
## 4 2009-04-13 Albany 17.5 23 0 4 9.1
## 5 2017-11-28 WaggaWagga 17.4 32.7 0.6 5.2 NA
## 6 2010-12-11 Witchcliffe 9.3 23.5 0 NA NA
## 7 2014-07-30 Wollongong 12.7 22.4 NA NA NA
## 8 2015-12-17 Hobart 15.4 26.5 0 9.4 10.9
## 9 2012-10-02 Bendigo 2.3 22 0 NA NA
## 10 2010-07-01 Penrith 1.4 16.6 0 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
The next time you randomly sample the dataset the resulting sample will be different:
%>% sample_frac(0.2) ds
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-11-12 Cobar 17.3 32.2 0 NA NA
## 2 2011-03-03 NorahHead 18.4 30.8 0 NA NA
## 3 2019-03-09 MountGinini 8.9 19.4 3 NA NA
## 4 2020-07-06 Hobart 7.9 13.2 0 1 NA
## 5 2018-03-29 Portland 12.2 21 0 NA NA
## 6 2010-11-08 GoldCoast 20.8 26.6 2.2 NA NA
## 7 2021-02-03 Sale 9 20.2 0 NA NA
## 8 2008-09-12 Adelaide 10.6 21.3 0 4.4 9.4
## 9 2014-03-31 Townsville 24.3 30.4 0 5.8 10.9
## 10 2014-01-03 WaggaWagga 19.4 35.1 0 6.4 6
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
To ensure the sample random sample each time use base::set.seed():
set.seed(72346)
%>% sample_frac(0.2) ds
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-12-14 MelbourneAirport 13.4 22 0 9.4 2.5
## 2 2014-02-02 SalmonGums 16.4 26.2 7.2 NA NA
## 3 2014-11-10 MountGambier 6.8 20.4 0 5.6 9.9
## 4 2020-12-05 BadgerysCreek 16.4 24.7 NA NA NA
## 5 2022-10-13 SydneyAirport 15.4 22.3 0 3.2 3.4
## 6 2010-04-02 Adelaide 16.1 26.7 0 NA 10.8
## 7 2010-10-08 Walpole 11 27.4 0 NA NA
## 8 2014-01-23 Bendigo 13.8 33.9 0 8.2 NA
## 9 2018-05-03 MountGinini 5.4 13.7 0 NA NA
## 10 2018-12-06 Witchcliffe 11.9 20.5 15.4 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
set.seed(72346)
%>% sample_frac(0.2) ds
## # A tibble: 45,374 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-12-14 MelbourneAirport 13.4 22 0 9.4 2.5
## 2 2014-02-02 SalmonGums 16.4 26.2 7.2 NA NA
## 3 2014-11-10 MountGambier 6.8 20.4 0 5.6 9.9
## 4 2020-12-05 BadgerysCreek 16.4 24.7 NA NA NA
## 5 2022-10-13 SydneyAirport 15.4 22.3 0 3.2 3.4
## 6 2010-04-02 Adelaide 16.1 26.7 0 NA 10.8
## 7 2010-10-08 Walpole 11 27.4 0 NA NA
## 8 2014-01-23 Bendigo 13.8 33.9 0 8.2 NA
## 9 2018-05-03 MountGinini 5.4 13.7 0 NA NA
## 10 2018-12-06 Witchcliffe 11.9 20.5 15.4 NA NA
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0