21.4 Text Documents
We load a sample corpus of text documents. Our corpus consists of a
collection of research papers all stored in the folder we identify
below. To work along with us in this module, you can create your own
folder called corpus/txt
and place into that folder a
collection of text documents. It does not need to be as many as we use
here but a reasonable number makes it more interesting.
## [1] "./corpus/txt"
We can list some of the file names.
## [1] 46
## [1] "acnn96.txt" "adm02.txt"
## [3] "ai02.txt" "ai03.txt"
## [5] "ai97.txt" "atobmars.txt"
## [7] "ausdm07.txt" "ctac99.txt"
## [9] "dawak02.txt" "dawak02w.txt"
## [11] "dmkd03.txt" "eJHI06.txt"
## [13] "gjwthesis.txt" "hdm05.txt"
## [15] "his03.txt" "hwrf12.txt"
## [17] "icdm02.txt" "icdm08.txt"
## [19] "ijdwm2012.txt" "karisk.txt"
## [21] "kdd00.txt" "kdd05.txt"
## [23] "kddmodel.txt" "kddrisk.txt"
## [25] "kes05.txt" "kes05full.txt"
## [27] "lca04.txt" "medinfo04.txt"
## [29] "miningmodels.txt" "pakdd01.txt"
## [31] "pakdd01w.txt" "pakdd03.txt"
## [33] "pakdd04.txt" "pakdd07.txt"
## [35] "pakdd99.txt" "pepnet.txt"
## [37] "rfpakdd12.txt" "RJournal_2009-1_Guazzelli+et+al.txt"
## [39] "RJournal_2009-2_Williams.txt" "seal98.txt"
## [41] "story.txt" "templates.txt"
## [43] "thim.txt" "titb08.txt"
## [45] "tkde05.txt" "tr02102.txt"
There are 46 documents in this particular corpus.
After loading the tm (Feinerer and Hornik 2024) package into the R library we are
ready to load the files from the directory as the source of the files
making up the corpus, using tm::DirSource(). The source object
is passed on to tm::Corpus() which loads the documents. We
save the resulting collection of documents in memory, stored in a
variable called docs
.
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 46
## [1] "SimpleCorpus" "Corpus"
## [1] "PlainTextDocument" "TextDocument"
## Length Class Mode
## acnn96.txt 2 PlainTextDocument list
## adm02.txt 2 PlainTextDocument list
## ai02.txt 2 PlainTextDocument list
## ai03.txt 2 PlainTextDocument list
## ai97.txt 2 PlainTextDocument list
## atobmars.txt 2 PlainTextDocument list
## ausdm07.txt 2 PlainTextDocument list
## ctac99.txt 2 PlainTextDocument list
## dawak02.txt 2 PlainTextDocument list
## dawak02w.txt 2 PlainTextDocument list
## dmkd03.txt 2 PlainTextDocument list
## eJHI06.txt 2 PlainTextDocument list
## gjwthesis.txt 2 PlainTextDocument list
## hdm05.txt 2 PlainTextDocument list
## his03.txt 2 PlainTextDocument list
## hwrf12.txt 2 PlainTextDocument list
## icdm02.txt 2 PlainTextDocument list
## icdm08.txt 2 PlainTextDocument list
## ijdwm2012.txt 2 PlainTextDocument list
## karisk.txt 2 PlainTextDocument list
## kdd00.txt 2 PlainTextDocument list
## kdd05.txt 2 PlainTextDocument list
## kddmodel.txt 2 PlainTextDocument list
## kddrisk.txt 2 PlainTextDocument list
## kes05.txt 2 PlainTextDocument list
## kes05full.txt 2 PlainTextDocument list
## lca04.txt 2 PlainTextDocument list
## medinfo04.txt 2 PlainTextDocument list
## miningmodels.txt 2 PlainTextDocument list
## pakdd01.txt 2 PlainTextDocument list
## pakdd01w.txt 2 PlainTextDocument list
## pakdd03.txt 2 PlainTextDocument list
## pakdd04.txt 2 PlainTextDocument list
## pakdd07.txt 2 PlainTextDocument list
## pakdd99.txt 2 PlainTextDocument list
## pepnet.txt 2 PlainTextDocument list
## rfpakdd12.txt 2 PlainTextDocument list
## RJournal_2009-1_Guazzelli+et+al.txt 2 PlainTextDocument list
## RJournal_2009-2_Williams.txt 2 PlainTextDocument list
## seal98.txt 2 PlainTextDocument list
## story.txt 2 PlainTextDocument list
## templates.txt 2 PlainTextDocument list
## thim.txt 2 PlainTextDocument list
## titb08.txt 2 PlainTextDocument list
## tkde05.txt 2 PlainTextDocument list
## tr02102.txt 2 PlainTextDocument list
References
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0