Abstract
Heterogeneous and dirty data is abundant. It is stored under different,
often opaque schemata, it represents identical real-world objects
multiple times, causing duplicates, and it has missing values and
conflicting values. The Humboldt Merger (HumMer) is a tool that allows
ad-hoc, declarative fusion of such data using a simple extension
to SQL.Guided by a query against multiple tables, HumMer proceeds
in three fully automated steps: First, instance-based schema matching
bridges schematic heterogeneity of the tables by aligning corresponding
attributes. Next, duplicate detection techniques find multiple representations
of identical real-world objects. Finally, data fusion and conflict
resolution merges duplicates into a single, consistent, and clean
representation.
Users
Please
log in to take part in the discussion (add own reviews or comments).