21.18 Creating a Document Term Matrix

A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. We use tm::DocumentTermMatrix() to create the matrix:

dtm <- DocumentTermMatrix(docs)

dtm
## <<DocumentTermMatrix (documents: 46, terms: 6508)>>
## Non-/sparse entries: 30059/269309
## Sparsity           : 90%
## Maximal term length: 56
## Weighting          : term frequency (tf)

We can inspect the document term matrix using tm::inspect(). Here, to avoid too much output, we select a subset of inspect.

inspect(dtm[1:5, 1000:1005])
## <<DocumentTermMatrix (documents: 5, terms: 6)>>
## Non-/sparse entries: 9/21
## Sparsity           : 70%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##             Terms
## Docs         path patholog patient pbs per perhap
##   acnn96.txt    0        0       0   0   0      0
##   adm02.txt     0        0       0   0   0      0
##   ai02.txt      4        6       9   4   4      1
##   ai03.txt      0        0      11  11   0      0
##   ai97.txt      0        0       3   0   0      0

The document term matrix is in fact quite sparse (that is, mostly empty) and so it is actually stored in a much more compact representation internally. We can still get the row and column counts.

class(dtm)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
dim(dtm)
## [1]   46 6508

The transpose is created using tm::TermDocumentMatrix():

tdm <- TermDocumentMatrix(docs)
tdm
## <<TermDocumentMatrix (terms: 6508, documents: 46)>>
## Non-/sparse entries: 30059/269309
## Sparsity           : 90%
## Maximal term length: 56
## Weighting          : term frequency (tf)

We will use the document term matrix for the remainder of the chapter.

Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.