20.39 Review Preparing the Corpus
Here in one sequence is collected the code to perform a text mining project. Notice that we would not necessarily do all of these steps so pick and choose as is appropriate to your situation.
# Locate and load the Corpus. <- file.path(".", "corpus", "txt") cname <- Corpus(DirSource(cname)) docs docssummary(docs) inspect(docs) # Transforms <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) toSpace <- tm_map(docs, toSpace, "/|@|\\|") docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, removeWords, stopwords("english")) docs <- tm_map(docs, removeWords, c("own", "stop", "words")) docs <- tm_map(docs, stripWhitespace) docs <- content_transformer(function(x, from, to) gsub(from, to, x)) toString <- tm_map(docs, toString, "specific transform", "ST") docs <- tm_map(docs, toString, "other specific transform", "OST") docs <- tm_map(docs, stemDocument)docs
Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.