SUMMARIZATION WITH ITEMSETS

This is the C++ source code accompanying the paper:

"Tell me what I need to know: succinctly summarizing data using itemsets",
Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. 
Proceedings of the ACM SIGKDD international conference on knowledge discovery
and data mining, 2011.



INSTRUCTIONS:

The code requires the GNU MPFR library, which can be found at 
http://www.mpfr.org/
To compile, simply type 'make'.



USAGE:
-h      --help                  display this help and exit
-f      --file                  data filename, obligatory
-k      --k                     number of itemsets in summary [default=10]
-s      --minsup                minimum support threshold [default=0.25]
-m      --maxsize               maximum itemset size threshold [default=0]
-o      --output                summary filename
-i      --items                 use item probs as background info [default]
-j      --no-items              do not use item probabilities
-r      --rowmargins            use row margins as background info
-a      --no-rowmargins         do not use row margins [default]
-y      --fly                   mine frequencies on the fly [default]
-x      --no-fly                mine and store all frequencies beforehand
-p      --penalty               type of penalty [1=BIC, 3=MDL; default=3]
-l      --low-stop              stop when score increases [default=false]
-t      --time                  stop after x seconds have elapsed [default=0]
-c      --seed                  initialize summary with itemsets from a file
-d      --dictionary            dictionary file containing item names
-e      --estimate              estimate itemsets in file and save to file
-w      --maxwidth              maximum number of items per group [default=0]
-g      --maxgroupsize          maximum number of itemsets per group [default=0]
-v      --verbosity             verbosity level [default=1]
-q      --quiet                 quiet mode [verbose=0]

The input file must be in 'fimi' format, i.e. each transaction is a line of
positive integers. Item numbers do not have to be consecutive and may start
from zero.  When using a seed file, each line must contain the frequency of 
an itemset followed by the itemset itself. For estimating frequencies, each
line must contain an itemset.



EXAMPLE:

./summarizer -f mushroom.dat -s 0.05 --fly -k 10 -v 2
