21.4 Text Documents

We load a sample corpus of text documents. Our corpus consists of a collection of research papers all stored in the folder we identify below. To work along with us in this module, you can create your own folder called corpus/txt and place into that folder a collection of text documents. It does not need to be as many as we use here but a reasonable number makes it more interesting.

cname <- file.path(".", "corpus", "txt")
cname
## [1] "./corpus/txt"

We can list some of the file names.

length(dir(cname))
## [1] 46
dir(cname)
##  [1] "acnn96.txt"                          "adm02.txt"                          
##  [3] "ai02.txt"                            "ai03.txt"                           
##  [5] "ai97.txt"                            "atobmars.txt"                       
##  [7] "ausdm07.txt"                         "ctac99.txt"                         
##  [9] "dawak02.txt"                         "dawak02w.txt"                       
## [11] "dmkd03.txt"                          "eJHI06.txt"                         
## [13] "gjwthesis.txt"                       "hdm05.txt"                          
## [15] "his03.txt"                           "hwrf12.txt"                         
## [17] "icdm02.txt"                          "icdm08.txt"                         
## [19] "ijdwm2012.txt"                       "karisk.txt"                         
## [21] "kdd00.txt"                           "kdd05.txt"                          
## [23] "kddmodel.txt"                        "kddrisk.txt"                        
## [25] "kes05.txt"                           "kes05full.txt"                      
## [27] "lca04.txt"                           "medinfo04.txt"                      
## [29] "miningmodels.txt"                    "pakdd01.txt"                        
## [31] "pakdd01w.txt"                        "pakdd03.txt"                        
## [33] "pakdd04.txt"                         "pakdd07.txt"                        
## [35] "pakdd99.txt"                         "pepnet.txt"                         
## [37] "rfpakdd12.txt"                       "RJournal_2009-1_Guazzelli+et+al.txt"
## [39] "RJournal_2009-2_Williams.txt"        "seal98.txt"                         
## [41] "story.txt"                           "templates.txt"                      
## [43] "thim.txt"                            "titb08.txt"                         
## [45] "tkde05.txt"                          "tr02102.txt"

There are 46 documents in this particular corpus.

After loading the tm (Feinerer and Hornik 2024) package into the R library we are ready to load the files from the directory as the source of the files making up the corpus, using tm::DirSource(). The source object is passed on to tm::Corpus() which loads the documents. We save the resulting collection of documents in memory, stored in a variable called docs.

docs <- Corpus(DirSource(cname))
docs
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 46
class(docs)
## [1] "SimpleCorpus" "Corpus"
class(docs[[1]])
## [1] "PlainTextDocument" "TextDocument"
summary(docs)
##                                     Length Class             Mode
## acnn96.txt                          2      PlainTextDocument list
## adm02.txt                           2      PlainTextDocument list
## ai02.txt                            2      PlainTextDocument list
## ai03.txt                            2      PlainTextDocument list
## ai97.txt                            2      PlainTextDocument list
## atobmars.txt                        2      PlainTextDocument list
## ausdm07.txt                         2      PlainTextDocument list
## ctac99.txt                          2      PlainTextDocument list
## dawak02.txt                         2      PlainTextDocument list
## dawak02w.txt                        2      PlainTextDocument list
## dmkd03.txt                          2      PlainTextDocument list
## eJHI06.txt                          2      PlainTextDocument list
## gjwthesis.txt                       2      PlainTextDocument list
## hdm05.txt                           2      PlainTextDocument list
## his03.txt                           2      PlainTextDocument list
## hwrf12.txt                          2      PlainTextDocument list
## icdm02.txt                          2      PlainTextDocument list
## icdm08.txt                          2      PlainTextDocument list
## ijdwm2012.txt                       2      PlainTextDocument list
## karisk.txt                          2      PlainTextDocument list
## kdd00.txt                           2      PlainTextDocument list
## kdd05.txt                           2      PlainTextDocument list
## kddmodel.txt                        2      PlainTextDocument list
## kddrisk.txt                         2      PlainTextDocument list
## kes05.txt                           2      PlainTextDocument list
## kes05full.txt                       2      PlainTextDocument list
## lca04.txt                           2      PlainTextDocument list
## medinfo04.txt                       2      PlainTextDocument list
## miningmodels.txt                    2      PlainTextDocument list
## pakdd01.txt                         2      PlainTextDocument list
## pakdd01w.txt                        2      PlainTextDocument list
## pakdd03.txt                         2      PlainTextDocument list
## pakdd04.txt                         2      PlainTextDocument list
## pakdd07.txt                         2      PlainTextDocument list
## pakdd99.txt                         2      PlainTextDocument list
## pepnet.txt                          2      PlainTextDocument list
## rfpakdd12.txt                       2      PlainTextDocument list
## RJournal_2009-1_Guazzelli+et+al.txt 2      PlainTextDocument list
## RJournal_2009-2_Williams.txt        2      PlainTextDocument list
## seal98.txt                          2      PlainTextDocument list
## story.txt                           2      PlainTextDocument list
## templates.txt                       2      PlainTextDocument list
## thim.txt                            2      PlainTextDocument list
## titb08.txt                          2      PlainTextDocument list
## tkde05.txt                          2      PlainTextDocument list
## tr02102.txt                         2      PlainTextDocument list

References

Feinerer, Ingo, and Kurt Hornik. 2024. Tm: Text Mining Package. https://tm.r-forge.r-project.org/.


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0