21.22 Removing Sparse Terms

We are often not interested in infrequent terms in our documents. Such ``sparse’’ terms can be removed from the document term matrix quite easily using tm::removeSparseTerms():

dim(dtm)

## [1]   46 6508

dtms <- removeSparseTerms(dtm, 0.1)
dim(dtms)

## [1] 46  6

This has removed most terms!

inspect(dtms)

## <<DocumentTermMatrix (documents: 46, terms: 6)>>
## Non-/sparse entries: 257/19
## Sparsity           : 7%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##                               Terms
## Docs                           data graham inform time use william
##   adm02.txt                     189      1     12   45  31       1
##   his03.txt                     176     17      1    7   7      25
##   ijdwm2012.txt                 110      2     65    4  65       2
##   karisk.txt                     83      1     24    6  55       3
##   kddmodel.txt                  127      1     23    1  44       3
##   pakdd07.txt                   161      1      6    2  26       3
##   RJournal_2009-2_Williams.txt  122      2      8    9  62      10
##   seal98.txt                    171      2     18   13  31       2
##   story.txt                     205     11      4   18  38      35
##   templates.txt                  67      1     55   25  34       6

We can see the effect by looking at the terms we have left:

freq <- colSums(as.matrix(dtms))
freq

##    data  graham  inform    time     use william 
##    3101     108     467     483    1366     236

table(freq)

## freq
##  108  236  467  483 1366 3101 
##    1    1    1    1    1    1

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0