Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast random $O(1)$ lookup of individual parameters directly on the compressed data (i.e. without blockwise decompression in RAM). While the community has proposd a number of succinct data structures that support queries over compressed representations, these approaches do not fully leverage the low-entropy structure prevalent in real-world workloads to reduce space. Inspired by recent advances in static function construction techniques, we propose a space-efficient representation of immutable key-value data, called CARAMEL, specifically designed for the case where the values are multi-sets. By carefully combining multiple compressed static functions, CARAMEL occupies space proportional to the data entropy with low memory overheads and minimal lookup costs. We demonstrate 1.25-16x compression on practical lookup tasks drawn from real-world systems, improving upon established techniques, including a production-grade read-only database widely used for development within Amazon.com.
翻译:查找表是众多数据处理和系统应用中的基础结构,例如自然语言处理中的分词文本、推荐系统中的量化嵌入集合、流式数据的整数草图,以及基因组学中基于哈希的字符串表示。随着网络规模数据的不断增长,此类应用通常需要支持直接在压缩数据上快速随机O(1)查找单个参数(即无需在RAM中进行分块解压)的压缩技术。尽管学界已提出若干支持对压缩表示进行查询的简洁数据结构,但这些方法未能充分利用实际工作负载中普遍存在的低熵结构以压缩空间。受静态函数构建技术最新进展的启发,我们提出了一种针对不可变键值数据的空间高效表示方法——CARAMEL,专为值集为多重集的情况设计。通过巧妙组合多个压缩静态函数,CARAMEL以低内存开销和最小查找代价实现了与数据熵成比例的空间占用。我们在实际系统中提取的查找任务上展示了1.25–16倍的压缩比,优于包括Amazon.com内部广泛用于开发的生产级只读数据库在内的现有技术。