28 Literate Data Science

20200602

A data scientist aims to tell stories supported by data. The narratives we build forms the key deliverables and must be well supported by the data.

In telling the narrative the analysis needs to be transparent, repeatable, and reproducible. We also capture and share our activities for quality assurance and for peer review. We will find ourselves repeating our work on other datasets in other scenarios and with other organisations. Documenting what we do helps when we come back to the code at a later time. Others will also want to reproduce our work and we should do all we can to facilitate that process. In short, we need to clearly communicate what we do so that we and others can understand and can continue the journey.

A general rule of thumb tells us that we should spend about a quarter of our time capturing what we have done—documenting our projects. Even more important is to capture this as we are doing the work rather than the chore of writing it up later. This does present an overhead and risks interrupting the flow of our work but the investment pays off longer term. Tools can be utilised to support the capture of our work with minimal interruption to our work flow.

To support the narrative and to encourage our efforts to be transparent, repeatable and reproducible we introduce the concept of literate programming (Knuth 1984). The concept is to intermix our narrative with the underlying analyses of the data (our code) within the one document. By introducing the concept here we aim to provide a solid foundation for the data scientist. We won’t always have the time or the patience to deliver a carefully crafted narrative telling the story derived from the data but we should strive to do so.

We will use knitr (Xie 2025) to support literate data science. This package combines the document typesetting power of the free and open source LaTeX software with the statistical power of R. Literate data science is also well supported by RStudio which is able to process the source document into a beautifully formatted PDF. This book is itself produced using knitr (Xie 2025).

In addition to these packages we also need to install the LaTeX software. LaTeX is a typesetting markup language which combined with knitr (Xie 2025) allows us to intermix R code with our narrative and to program certain parts of the narrative using R. LaTeX is free and open source software and instructions for installing are available from the LaTeX Project.

See the GNU/Linux Desktop Survival Guide for using markdown instead of LaTeX.

References

Knuth, Donald E. 1984. “Literate Programming.” The Computer Journal (British Computer Society) 27 (2): 97–111. http://www.literateprogramming.com/knuthweb.pdf.

———. 2025. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0