21.14 Remove Own Stop Words

docs <- tm_map(docs, removeWords, c("department", "email"))

viewDocs(docs, 16)

## hybrid weighted random forests 
## classifying  highdimensional data
## baoxun xu  joshua zhexue huang  graham williams 
## yunming ye
## 
## 
##   computer science harbin institute  technology shenzhen graduate
## school shenzhen  china
## 
## shenzhen institutes  advanced technology chinese academy  sciences shenzhen
##  china
##  amusing gmailcom
## random forests   popular classification method based   ensemble  
## single type  decision trees  subspaces  data   literature 
##  many different types  decision tree algorithms including c cart 
## chaid  type  decision tree algorithm may capture different information
##  structure  paper proposes  hybrid weighted random forest algorithm
## simultaneously using  feature weighting method   hybrid forest method 
## classify  high dimensional data  hybrid weighted random forest algorithm
## can effectively reduce subspace size  improve classification performance
## without increasing  error bound  conduct  series  experiments  eight
## high dimensional datasets  compare  method  traditional random forest
## methods   classification methods  results show   method
## consistently outperforms  traditional methods
## keywords random forests hybrid weighted random forest classification decision tree
## 
....

Previously we used the English stopwords provided by tm (Feinerer and Hornik 2025). We could instead or in addition remove our own stop words as we have done above. We have chosen here two words, simply for illustration. The choice might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

References

Feinerer, Ingo, and Kurt Hornik. 2025. Tm: Text Mining Package. https://tm.r-forge.r-project.org/.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0