8.5 apriori train

UNDER DEVELOPMENT 20211217

ml train apriori [options] [file.csv]
     -c <0-1>      --confidence=<0-1>   Minimum confidence threshold.
                   --id=<name>          The id column name.
     -o <file>     --output=<file>      Save itemsets to .csv or .rds file.
     -s <0-1>      --support=<0-1>      Minimum support threshold.

Input file is a two column csv file. One column is the basket id and the other is an item in that basket. The item column can have any name. If no data file is supplied the data is read from stdin, often as part of a pipeline. An example input data file might be:

id,item
u1234567,comp1234
u1234567,comp2345
u1234567,comp3456
u1234567,comp4567
u1234568,comp1234
u1234568,comp4567
...

Output to stdout (by default) is a row for each association rule, together with a number of measures of the quality of the rule:

$ ml train apriori mcomp.csv
rule,support,confidence,coverage,lift,count
comp1234:comp2345=>comp4567,0.6,1.0,0.6,1.4,6
...

The count for a rule A=>B is the number of transactions that contain the items in A and B. The support is the proportion of transactions that contain A and B (i.e., the count over the total number of transactions in the dataset). The confidence is the probability of B being in a transaction whenever A is in then transaction (i.e., the count over the total number of transactions that contain A). The coverage is the proportion of transactions that contain A. The lift is a measure of the correlation between A and B. If it is 1 then there is no correlation. Above 1 is a positive correlation and below 1 is a negative correlation.

As with itemsets, the output can be saved to a named csv file using --output= (or -o), with the argument being a filename including the .csv extension. If the filename extension is instead .rds then the result is saved as a single object in the named file.

ml train apriori -o rules.csv mcomp.csv
ml train apriori -o rules.rds mcomp.csv

Other options are similar to itemsets with the addition of --confidence (-c) as a minimum threshold for the confidence of the rules that we are interested in.



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0