21.3 Corpus Sources and Readers

There are a variety of sources supported by . We can use tm::getSources() to list them.

getSources()
## [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
## [5] "XMLSource"       "ZipSource"

In addition to different kinds of sources of documents, our documents for text analysis will come in many different formats. A variety are supported by :

getReaders()
##  [1] "readDataframe"           "readDOC"                
##  [3] "readPDF"                 "readPlain"              
##  [5] "readRCV1"                "readRCV1asPlain"        
##  [7] "readReut21578XML"        "readReut21578XMLasPlain"
##  [9] "readTagged"              "readXML"


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0