7.3 CSV Data Reading

20211124 One of the simplest and most common ways of sharing data today remains the Comma Separated Values (CSV) format. As a simple and even trivial format the use of csv files has become a standard file format used to exchange data between many different applications.

csv files can, for example, be exported and imported by numerous applications and spreadsheets and databases, including Rattle, LibreOffice Calc, Gnumeric, MS/Excel, SAS/Enterprise Miner, Teradata, Netezza, and many, many, other applications.

The downside of the csv format is that the file does not contain explicit metadata (i.e., data about the data), like the data types of the different columns. Typically we would like to know whether there is numeric or character data within the column. If numeric data then are they dates, or dollars, and if character are they categoric (factors). Without this metadata R commands for reading the data have to make a guess and will sometimes determine the wrong data type for a particular column. There are options available to reduce this possibility or to provide the metadata.

Reading csv files is straight forward using readr::read_csv().

library(readr)        # Read/write delimited data: read_csv().

"mydata.csv" %>%
  read_csv() ->
ds

Column types can be specified using col_types=.

"mydata" %>%
  read_csv(col_types="inffcD") ->
ds

The string can contain any of c (character), i (integer), n (numeric), d (double), l (logical), f (factor), D (date), T (data and time), t (time), ? (guess), _, (ignore).

To name the columns as the data is read use col_names=

"mydata" %>%
  read_csv(col_names=c("a", "b", "c")) ->
ds

By default the first row usually has the column names, so if by providing col_names= the intent is to override the column names (rather than providing the missing column names) be sure to skip the first row using skip=

"mydata" %>%
  read_csv(skip=1, col_names=c("a", "b", "c")) ->
ds


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0