Let \(X\), \(Y\) be two itemsets: \(X,Y \subseteq I\) and \(X \cap Y=\emptyset\).
Association rules represent implications of the form \(X \rightarrow Y\)
Support of a rule: The fraction of transactions containing \(X \cup Y\) \[s(X \rightarrow Y)=s(X \cup Y)\]
Confidence \(c\) of a rule: the fraction of transactions containing \(X \cup Y\) in the set of transactions containing \(X\). \[c(X \rightarrow Y)=\frac{s(X \cup Y)}{s(X)}\]
Association Rule Mining (ARM)
Given:
Set of items \(I\)
Transaction database over \(I\)
Minimum support threshold \(t_s\) and a minimum confidence threshold \(t_c\)
Goal: Find all association rules \(X\rightarrow Y\) in DB with minimum support threshold and a minimum confidence i.e.: \[\{X\rightarrow Y | s(X \cup Y) \geq t_s, c(X\rightarrow Y)\geq t_c \}\] These rules are called strong.
In general for \(|I|\) items: \(\binom{|I|}{1} + \binom{|I|}{2} + ... + \binom{|I|}{K} = 2^{|I|}-1\) itemsets
Apriori Algorithm
Reduces the candidate itemsets to be tested: If an itemset is frequent, then all of its subsets must also be frequent. And if an itemset is infrequent, its supersets must not be tested.
Initialise: \(k=1\). Scan DB to get frequent 1-itemsets
Repeat:
Set \(k=k+1\)
generate length \(k\) candidate itemsets from length \(k-1\) frequent itemsets
test the candidates against DB to get frequent \(k\)-itemset
Stop when no frequent or candidate set was generated in 3.
Example
Database with 9 transactions
Minimum support \(t_{s}=22\%\)
Minimum confidence \(t_c=70\%\)
Identify frequent itemsets using Apriori
Identify association rules
Database
tid
items
1
{chips, coke, whiskey}
2
{beer, chips}
3
{chips, ice}
4
{beer, chips, coke}
5
{coke, ice}
6
{chips, ice}
7
{coke, ice}
8
{chips, ice, coke, whiskey}
9
{chips, coke, ice}
Example - \(k=1\)
Database
tid
items
1
{chips, coke, whiskey}
2
{beer, chips}
3
{chips, ice}
4
{beer, chips, coke}
5
{coke, ice}
6
{chips, ice}
7
{coke, ice}
8
{chips, ice, coke, whiskey}
9
{chips, coke, ice}
Candidates \(C_1\)
itemset
\(s\)
{coke}
67%
{chips}
78%
{ice}
67%
{beer}
22%
{whiskey}
22%
Frequent itemsets \(L_1\)
itemset
\(s\)
{coke}
67%
{chips}
78%
{ice}
67%
{beer}
22%
{whiskey}
22%
Generate candidates
\(C_k\) is generated by
Self-joining \(L_{k-1}\): \(L_{k-1} \cdot L_{k-1}\). Two \((k-1)\)-itemsets are joined, if they agree in the first \((k-2)\) items
Pruning all \(k\)-itemsets with a \((k-1)\)-subset that is not frequent, i.e. not in \(L_{k-1}\)