CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions

from arxiv, 8 pages. The first version of this paper included an additional theorem related to Bloom filter pre-filtering. This result was removed in subsequent versions and significantly improved upon in a follow-up paper arXiv:2603.24882

Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast random $O(1)$ lookup of individual parameters directly on the compressed data (i.e. without blockwise decompression in RAM). While the community has proposd a number of succinct data structures that support queries over compressed representations, these approaches do not fully leverage the low-entropy structure prevalent in real-world workloads to reduce space. Inspired by recent advances in static function construction techniques, we propose a space-efficient representation of immutable key-value data, called CARAMEL, specifically designed for the case where the values are multi-sets. By carefully combining multiple compressed static functions, CARAMEL occupies space proportional to the data entropy with low memory overheads and minimal lookup costs. We demonstrate 1.25-16x compression on practical lookup tasks drawn from real-world systems, improving upon established techniques, including a production-grade read-only database widely used for development within Amazon.com.

翻译：查找表是众多数据处理和系统应用中的基础结构，例如自然语言处理中的分词文本、推荐系统中的量化嵌入集合、流式数据的整数草图，以及基因组学中基于哈希的字符串表示。随着网络规模数据的不断增长，此类应用通常需要支持直接在压缩数据上快速随机O(1)查找单个参数（即无需在RAM中进行分块解压）的压缩技术。尽管学界已提出若干支持对压缩表示进行查询的简洁数据结构，但这些方法未能充分利用实际工作负载中普遍存在的低熵结构以压缩空间。受静态函数构建技术最新进展的启发，我们提出了一种针对不可变键值数据的空间高效表示方法——CARAMEL，专为值集为多重集的情况设计。通过巧妙组合多个压缩静态函数，CARAMEL以低内存开销和最小查找代价实现了与数据熵成比例的空间占用。我们在实际系统中提取的查找任务上展示了1.25–16倍的压缩比，优于包括Amazon.com内部广泛用于开发的生产级只读数据库在内的现有技术。

相关内容

查找表

关注 0

在计算机科学中，查找表是一个用更简单的数组索引操作代替运行时计算的数组。在处理时间方面的节省是可观的，因为从存储器中检索值通常比进行“昂贵”的计算或输入/输出操作要快。这些表可以预先计算并存储在静态程序存储中，作为程序初始化阶段（内存化）的一部分进行计算（或“预取”），甚至可以存储在特定于应用程序平台中的硬件中。查找表还广泛用于通过与数组中的有效（或无效）项列表进行匹配来验证输入值，并且在某些编程语言中，查找表可能包含指针函数（或标签偏移量）以处理匹配的输入。 FPGA还广泛使用可重新配置的，硬件实现的查找表，以提供可编程的硬件功能。

【博士论文】半结构化表格数据上的信息检索

专知会员服务

24+阅读 · 2025年9月7日

【阿姆斯特丹博士论文】表格表示学习，179页pdf

专知会员服务

36+阅读 · 2024年4月6日

【牛津大学博士论文】变分自编码器: 监督、校准和多模态学习的变分自编码器，179页pdf

专知会员服务

38+阅读 · 2023年6月21日

网络表示如何可解释？Syracuse大学最新WWW2022《可解释表示学习》教程，附97页ppt

专知会员服务

50+阅读 · 2022年4月30日