Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast random $O(1)$ lookup of individual parameters directly on the compressed data (i.e. without blockwise decompression in RAM). While the community has proposd a number of succinct data structures that support queries over compressed representations, these approaches do not fully leverage the low-entropy structure prevalent in real-world workloads to reduce space. Inspired by recent advances in static function construction techniques, we propose a space-efficient representation of immutable key-value data, called CARAMEL, specifically designed for the case where the values are multi-sets. By carefully combining multiple compressed static functions, CARAMEL occupies space proportional to the data entropy with low memory overheads and minimal lookup costs. We demonstrate 1.25-16x compression on practical lookup tasks drawn from real-world systems, improving upon established techniques, including a production-grade read-only database widely used for development within Amazon.com.
翻译:查找表是许多数据处理和系统应用中的基础结构,例如NLP中的分词文本、推荐系统中的量化嵌入集合、流式数据的整数草图以及基因组学中基于哈希的字符串表示。随着网络规模数据的增长,此类应用通常需要支持在压缩数据上直接进行随机$O(1)$参数快速查找(即无需在RAM中逐块解压)的压缩技术。尽管学术界已提出多种支持压缩表示查询的简洁数据结构,但这些方法并未充分利用实际工作负载中普遍存在的低熵结构来减少空间占用。受静态函数构建技术最新进展的启发,我们提出了一种针对不可变键值数据的空间高效表示方法——CARAMEL,专门设计用于值为多重集的情形。通过精心组合多个压缩静态函数,CARAMEL占用与数据熵成比例的空间,具有较低的内存开销和最小的查找成本。我们在源自实际系统的查找任务上实现了1.25-16倍的压缩比,优于既有技术(包括亚马逊内部广泛用于开发的生产级只读数据库)。