10.43 Data Review
20180721 Having ingested the dataset and normalised the variable names we can now explore more. Using dplyr::glimpse() gives us some insight:
## Rows: 226,868
## Columns: 24
## $ date <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008-…
## $ location <chr> "Albury", "Albury", "Albury", "Albury", "Albury", "Alb…
## $ min_temp <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.1…
## $ max_temp <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9, …
## $ rainfall <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0,…
## $ evaporation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sunshine <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ wind_gust_dir <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W, SW, …
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 44…
## $ wind_dir_9am <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, NNW, W…
## $ wind_dir_3pm <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE, NNW,…
## $ wind_speed_9am <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, 24, 4,…
## $ wind_speed_3pm <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28, 20, …
## $ humidity_9am <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, 76, 65…
## $ humidity_3pm <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 93, 43,…
## $ pressure_9am <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, 1009.6…
## $ pressure_3pm <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, 1008.2…
## $ cloud_9am <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, NA, NA,…
## $ cloud_3pm <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, 7, NA,…
## $ temp_9am <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, 18.3, …
## $ temp_3pm <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, 30.2, …
## $ rain_today <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes, Yes,…
## $ risk_mm <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, 2.2,…
## $ rain_tomorrow <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Yes, Yes…
Observe the variety of data types here, ranging from Date
(date)
, through character (chr
) and
numeric (dbl
). The data mostly looks as expected
though it is odd that evaporation
and sunshine
are identified as character. Probably because they seem to be all
missing, at least in the first 10 or so observations. We begin
question other aspects of the data too. For example, is
date
an ongoing sequence of days as it appears to be here?
Does location
have values other than Albury
? What
is the distribution of the different variables?
These are all questions we will start asking ourselves in the context of ``living and breathing’’ our data. Our aim should be to gleam all we can about the data that we are dealing with. Data science is very much about understanding, not blindly processing. The excitement is in the discovery of patterns in the data and the narrative the data is seeking to tell.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0