21.22 Removing Sparse Terms
We are often not interested in infrequent terms in our documents. Such ``sparse’’ terms can be removed from the document term matrix quite easily using tm::removeSparseTerms():
## [1] 46 6508
## [1] 46 6
This has removed most terms!
## <<DocumentTermMatrix (documents: 46, terms: 6)>>
## Non-/sparse entries: 257/19
## Sparsity : 7%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs data graham inform time use william
## adm02.txt 189 1 12 45 31 1
## his03.txt 176 17 1 7 7 25
## ijdwm2012.txt 110 2 65 4 65 2
## karisk.txt 83 1 24 6 55 3
## kddmodel.txt 127 1 23 1 44 3
## pakdd07.txt 161 1 6 2 26 3
## RJournal_2009-2_Williams.txt 122 2 8 9 62 10
## seal98.txt 171 2 18 13 31 2
## story.txt 205 11 4 18 38 35
## templates.txt 67 1 55 25 34 6
We can see the effect by looking at the terms we have left:
## data graham inform time use william
## 3101 108 467 483 1366 236
## freq
## 108 236 467 483 1366 3101
## 1 1 1 1 1 1
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0