We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$-mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.
翻译:我们研究为具有高度偏斜值分布的大规模键值数据集构建空间高效内存索引的问题。这一挑战在诸多数据密集型领域普遍存在,在计算基因组学中尤为突出——其中包含数十亿条记录的$k$-mer计数表往往被单一高频值主导。虽然近期研究提出通过为压缩静态函数(CSF)添加预滤波器来解决该问题,但现有方法依赖复杂启发式规则且缺乏形式化保证。本文提出一种名为AutoCSF的规范化算法,通过将CSF与预滤波相结合,可证明性地处理偏斜分布并实现接近最优的空间利用率。我们在以下三方面改进了先前的CSF预滤波构造方法:(1)推导出判定滤波器增强何时有益的数学严格决策准则;(2)提出将CSF与现代集合成员数据结构(超越经典布隆过滤器)相集成的通用算法框架;(3)建立关于索引整体空间使用量的理论保证。我们开源的AutoCSF实现在维持低查询延迟的同时,相较于基线方法展现出显著的空间节省。