1 Introduction
-
If the support threshold is set too high, very few (or even no) frequent itemsets can be discovered, and the users cannot find enough useful information for their decisions.
-
If the support threshold is set too small, a significantly large number of frequent itemsets will be generated, which not only degrades the efficiency seriously, but also overwhelms the users by enormous size of the results.
1.1 Gap Analysis of Top-k FIM
1.2 Our Contributions
-
This paper develops a novel prefix-partitioning-based PTF algorithm to mine top-k frequent itemsets on massive data efficiently, which only retrieves a small fraction of transaction table for computing the results.
-
Hybrid vertical storage mode is proposed in this paper to keep the vertical representation of the items in one of three storage formats adaptively and reduce the involved I/O cost significantly.
-
Vertical mining is performed on the processed partitions of vertical representation, where the itemsets are generated with the high-support-first principle to speed up the in-memory computation.
-
The experimental results on synthetic and real-life data sets show that PTF outperforms the existing algorithms significantly.
2 Related Works
2.1 Frequent Itemset Mining
2.2 Top-k Frequent Itemset Mining
3 Preliminaries
3.1 Problem Definition
Symbol | Meaning |
---|---|
\(\mathcal {I}\) | the set of items \(\{ x_1, x_2, \ldots , x_d \}\) |
k | The number of the required results |
\(FI_k\) | The k most frequent itemsets |
MH | Min heap to keep the k most frequent itemsets so far |
rmsup | The kth highest support of the itemsets so far |
\(P_i\) | The prefix-based partition corresponding to \(x_i\) |
\(AR_i\) | The array to maintain the promising items in \(P_i\) |
QE | The priority queue to keep the candidates |
3.2 Problem Analysis
-
Although minsup is not specified explicitly, it is used throughout the execution to discard the candidate itemsets by anti-monotone property. The existing algorithms set minsup equal to zero initially, and gradually raise minsup when more itemsets are generated. Some pre-computation is required to raise the initial minsup properly at a low cost. This can improve the efficiency partly.
-
On massive data, the I/O cost of a single sequential scan is already relatively high. In order to achieve high efficiency, a sub-linear algorithm is more desirable, which only involves a small fraction of the transactions. This requires designing the more resultful pruning strategy to reduce the number of the involved transactions.
-
Besides the pruning of the whole transactions, it is also desirable to narrow down the used items in each involved transaction correspondingly.
4 Prefix-Based Partitioning
4.1 Partitioning Strategy
4.2 The Space Consumption in Direct Form
4.3 Some Pre-constructed Structures
5 PTF Algorithm
5.1 The Overview of the Algorithm
5.2 Basic Implementation
-
If \(HT(X y_2) \ne null\), the tid-set \(HT(X y_1)\) of \(X y_1\) and the tid-set \(HT(X y_2)\) of \(X y_2\) are intersected to generate the tid-set of the itemset \(X y_1 y_2\) (line 11). If \(\sigma (X y_1 y_2) > rmsup\), PTF keeps \(X y_1 y_2\) and its tid-sets in HT, and also inserts it in QE for the further exploration (line 13 and line 14 of Algorithm 2).
-
Otherwise, \(HT(X y_2)\) = null, as proved in Theorem 3, \(\sigma (X y_2) \le rmsup\), PTF proceeds with the next item in \(AR_i\) to expand \(X_{rt}\).
5.3 Improvement 1: Hybrid Vertical Storage Mode
5.3.1 Storage Mode
-
The first value is the size of tid-list \(L(x_j)\), which is a list to keep the tids of the transactions containing \(x_j\).
-
The second value is the size of the dif-list \(D(x_j)\), i.e. \(\{1,2, \ldots , n_i\} - L(x_j)\), which is a list to keep the tids of the transactions not containing \(x_j\).
-
The third value is the size of the bit-vector \(BV(x_j)\) of \(n_i\) bits. The ath bit in \(BV(x_j)\) is 1 if the ath transaction contains \(x_j\), otherwise, the ath bit in \(BV(x_j)\) is 0.
5.3.2 The Intersection of the Tid-Sets
5.4 Improvement 2: Candidate Pruning
5.5 Analysis
-
If \(|AR_i| = 0\), the support of \(\{ x_i \}\) is no larger than rmsup, supports of any itemsets generated in \(P_i\) are no larger than rmsup.
-
If \(|AR_i| = 1\) or 2, only 1-itemsets or 2-itemsets are needed to be checked in \(P_i\). Considering that in the initialization of rmsup, MH has maintained the 1-itemsets and 2-itemsets that are possible to be top-k frequent itemsets. PTF still does not need to deal with \(P_i\).
6 Performance Evaluation
Parameter | Used values |
---|---|
Transaction number (\(10^7\)) (syn) | 1–100 |
Result number (syn) | 1000–10,000 |
The number of items (syn) | 500–2500 |
The average transaction width (syn) | 10–30 |
Result size (real-life sparse data set) | 1000–10,000 |
Result size (real-life dense data set) | 1000–5000 |
Dataset | \(\#\)Trans | \(\#\)Items | Avg.Tr.Width | Type |
---|---|---|---|---|
Chainstore | 1,112,949 | 46,086 | 7.23 | Sparse |
OnlineRetail | 540,455 | 2603 | 4.37 | Sparse |
Accidents | 340,183 | 468 | 33.8 | Dense |
Connect | 67,557 | 129 | 43 | Dense |
Pumsb | 49,046 | 2113 | 74 | Dense |
Susy | 5,000,000 | 179 | 19 | Dense |
ratio | Chain | Online | Accid | Conn | Pum | Susy |
---|---|---|---|---|---|---|
\(r_{hvs}\) | 16.94 | 0.55 | 3.26 | 1.25 | 5.86 | 1.21 |
\(r_{drt}\) | 9.31 | 3.11 | 18 | 22.47 | 37.99 | 10.25 |
\(r_{crm|mtk}\) | 1 | 1 | 1 | 1 | 1 | 1 |