8.4 apriori itemsets

20220216

Itemsets are a basic concept for association rules. An itemset is the set of items that frequently occur together in multiple baskets.

ml itemsets apriori [options] [file.csv]
                   --id=<name>          The id column name.
     -o <file>     --output=<file>      Save itemsets to .csv or .rds file.
     -s <0-1>      --support=<0-1>      Minimum support threshold.

The input file is a two column csv file. One column is the basket id and the other is an item in that basket. The item column can have any name. If no data file is supplied the data is read from stdin, often as part of a pipeline. An example input data file might be:

id,item
u1234567,comp1234
u1234567,comp2345
u1234567,comp3456
u1234567,comp4567
u1234568,comp1234
u1234568,comp4567
...

Output to stdout (by default) is a row for each possible basket item set combination, with a frequency count and support:

$ ml itemsets apriori mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
comp2345,123,0.45
...

The output can be saved to a named csv file with --output= (or -o), with the argument being a filename including the .csv extension. If the filename extension is instead .rds then the result is saved as a single object in the named file.

ml itemset apriori -o itemsets.csv mcomp.csv
ml itemset apriori -o itemsets.rds mcomp.csv

Output can be filtered to include only those itemsets with at least a specified value for the support. The default support threshold is 10% (0.1). The support for an itemset is simply the proportion of baskets which contain all items in the itemset.

$ ml itemsets apriori --support=0.5 mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
...

A column named id is expected. In general though the identifier could be any column (like ID):

ID,Course
u1234567,comp1234
u1234568,comp2345
...

To use a non-id column as the identifier use --id=

$ ml itemsets apriori --id=ID mcomp.csv
pattern,freq,support
comp1234:comp4567,145,0.75
...

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0