8 Data Template

REVIEW The business understanding phase of a data science project aims to understand the business problem and to then liaise with the business data technicians to identify the data available. This is followed by the data understanding phase where we work with the business data technicians to access and ingest the data into R. We are then in a position to initiate our journey of discovery driven by the data. By living and breathing the data in the context of the business problem we gain our bearings and feed our intuitions as we journey.

In this chapter we present the common series of steps that initialise the data phase of data science—the data setup. Through this chapter we extract the basic shape and characteristics of the dataset. We prepare the dataset for exploration and wrangling. At the end of this chapter we will have a template for the repeatable end-to-end processing of the data. As you become proficient with R and data science you will develop your own habits and idiosyncrasies which you will incorporate into your own template.

The template concept, developed extensively by Graham J. Williams (2017), consists of canonical programming codes that can be reused with little or no modification on a new dataset. The intention is that to get started with a new dataset only a few initial lines code within the template need to be modified. Only minimal change is then required for the remainder of the codes within the template. For the software engineer, the concept of a template is a stepping stone toward developing functions in R that are general and reusable. For us though, rather than delving into the intricacies of the R language we immerse ourselves into using R to achieve our outcomes, learning R as we proceed and moving into more sophisticated software engineering practices.

Rather than delving into the intricacies of the R language we immerse ourselves into using R to achieve our outcomes, learning more about R as we proceed.

The template consists of programming code that can be reused with little or no modification on a new dataset. The intention is that to get started with a new dataset only a few lines at the top of the template need to be modified. Minimal (if any) change is then required for the remainder of the code. In many respects the concept of a template is a stepping stone toward writing functions in R.

References

Williams, Graham J. 2017. The Essentials of Data Science: Knowledge Discovery Using r. The r Series. CRC Press.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0