9.3 Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location     min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>           <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-09-08 Cairns           23.5     30.2      0           7        7.8
##  2 2011-10-25 Mildura           8.7     21.5     13.4         4.4     12.2
##  3 2017-06-24 Launceston       -0.5     11.4      8.4        NA       NA  
##  4 2014-09-26 Katherine        20.1     37.5      0          11       NA  
##  5 2015-02-05 Nhil              8.1     32.8      0          NA       NA  
##  6 2015-02-03 CoffsHarbour     18.1     22.3    104.         NA       NA  
##  7 2009-05-20 Cobar            12.5     15.6     17.6         0        0  
##  8 2010-01-19 Hobart           11.7     19.4      0.6         5        8.1
##  9 2018-12-31 Sydney           21.7     30.5      0           8        4.1
## 10 2011-12-19 Mildura          15.7     28.4      0           5.8     12.9
## # … with 35,339 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>

The next time you randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location      min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>            <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2011-03-22 BadgerysCreek     19.2     32       13.2        NA       NA  
##  2 2011-05-26 Newcastle          9.9     17.2      0          NA       NA  
##  3 2013-07-04 AliceSprings       0.3     23.2      0           4       10.5
##  4 2017-06-23 Cairns            19.7     27.3      1          NA       NA  
##  5 2019-01-16 Uluru             30.1     45.3      0          NA       NA  
##  6 2013-06-12 Uluru              3.1     17.3      0.2        NA       NA  
##  7 2019-09-02 Brisbane          11.6     29.3      0           5       10.1
##  8 2012-10-28 Richmond          13.9     21.9      0          NA       NA  
##  9 2010-11-26 Melbourne         16       25.1      1.8         0.8      4.6
## 10 2012-03-09 NorfolkIsland     19.1     25.2      0.2         5.2      6.6
## # … with 35,339 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location    min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>          <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-07-20 Brisbane        13.4     20.7      1           0.8     NA  
##  2 2015-08-24 Walpole          6       15.7      3.2        NA       NA  
##  3 2012-08-02 Cobar            1.6     17.2      0           2.4     NA  
##  4 2010-03-16 Canberra         9.9     26.6      0           4.6     11.5
##  5 2010-08-03 SalmonGums       0.8     20        0          NA       NA  
##  6 2010-06-20 Portland         9.5     14.1      2.6         0.6      4  
##  7 2014-01-11 Melbourne       17.8     22.7      0          12.2      6.7
##  8 2018-01-29 Uluru           26.1     40.8      0.2        NA       NA  
##  9 2011-01-22 Williamtown     14.8     29.5      0           7.8     12.8
## 10 2012-07-14 Cairns          22.6     28.5      0.8         2        2.5
## # … with 35,339 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>
set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location    min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>          <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-07-20 Brisbane        13.4     20.7      1           0.8     NA  
##  2 2015-08-24 Walpole          6       15.7      3.2        NA       NA  
##  3 2012-08-02 Cobar            1.6     17.2      0           2.4     NA  
##  4 2010-03-16 Canberra         9.9     26.6      0           4.6     11.5
##  5 2010-08-03 SalmonGums       0.8     20        0          NA       NA  
##  6 2010-06-20 Portland         9.5     14.1      2.6         0.6      4  
##  7 2014-01-11 Melbourne       17.8     22.7      0          12.2      6.7
##  8 2018-01-29 Uluru           26.1     40.8      0.2        NA       NA  
##  9 2011-01-22 Williamtown     14.8     29.5      0           7.8     12.8
## 10 2012-07-14 Cairns          22.6     28.5      0.8         2        2.5
## # … with 35,339 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>


Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.