How to wring a table dry: entropy compression of relations and querying of compressed relations
V. Raman, и G. Swart. VLDB '06: Proceedings of the 32nd international conference on Very large data bases, стр. 858--869. VLDB Endowment, (2006)
Аннотация
We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.
%0 Conference Paper
%1 1164201
%A Raman, Vijayshankar
%A Swart, Garret
%B VLDB '06: Proceedings of the 32nd international conference on Very large data bases
%D 2006
%I VLDB Endowment
%K compression database retrieval vldb
%P 858--869
%T How to wring a table dry: entropy compression of relations and querying of compressed relations
%U http://portal.acm.org/citation.cfm?id=1164127.1164201
%X We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.
@inproceedings{1164201,
abstract = {We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.},
added-at = {2007-12-06T05:24:55.000+0100},
author = {Raman, Vijayshankar and Swart, Garret},
biburl = {https://www.bibsonomy.org/bibtex/2de490a1d1ac2bda72d623075316dbf0d/jhammerb},
booktitle = {VLDB '06: Proceedings of the 32nd international conference on Very large data bases},
description = {How to wring a table dry},
interhash = {12225afb0f18b87a4654c25ac63082a1},
intrahash = {de490a1d1ac2bda72d623075316dbf0d},
keywords = {compression database retrieval vldb},
location = {Seoul, Korea},
pages = {858--869},
publisher = {VLDB Endowment},
timestamp = {2007-12-06T05:24:55.000+0100},
title = {How to wring a table dry: entropy compression of relations and querying of compressed relations},
url = {http://portal.acm.org/citation.cfm?id=1164127.1164201},
year = 2006
}