Artikel,

Utilising a statistical inequality for efficiently finding term sets

.
Information Processing & Management, 52 (6): 1086--1121 (November 2016)
DOI: 10.1016/j.ipm.2016.04.011

Zusammenfassung

Information Retrieval (IR) systems aim to find sets of terms that discriminate documents and often exploit frequency as an evidence that signals a non-random set of terms. Frequent Itemset (FI) mining refers to a class of algorithms that can be applied to IR to find non-random set of terms. Finding FIs is a very expensive computational task because of the exponential number of itemsets. To reduce this cost, many approaches to mining FIs are based on the monotonicity property that an itemset is frequent only if all its subsets are frequent. However, it is still uncertain whether an itemset is frequent if all its subsets are frequent, thus requiring additional scans and eventually computational cost. We introduce a statistical inequality called Bell-Wigner Inequality (BWI) as a conceptual enhancement of monotonicity to predict with certainty when an itemset is frequent and when it is infrequent. Using both data mining datasets and a large IR test collection, an empirical validation shows that the BWI can significantly reduce computational cost.

Tags

Nutzer

  • @lepsky

Kommentare und Rezensionen