Membership (membership query / membership testing) is a fundamental problem across databases, networks and security. However, previous research has primarily focused on either approximate solutions, such as Bloom Filters, or exact methods, like perfect hashing and dictionaries, without attempting to develop a an integral theory. In this paper, we propose a unified and complete theory, namely chain rule, for general membership problems, which encompasses both approximate and exact membership as extreme cases. Building upon the chain rule, we introduce a straightforward yet versatile algorithm framework, namely ChainedFilter, to combine different elementary filters without losing information. Our evaluation results demonstrate that ChainedFilter performs well in many applications: (1) it requires only 26% additional space over the theoretical lower bound for implicit static dictionary, (2) it requires only 0.22 additional bit per item over the theoretical lower bound for lossless data compression, (3) it reduces up to 31% external memory access than raw Cuckoo Hashing, (4) it reduces up to 36% P99 tail point query latency than Bloom Filter under the same space cost in RocksDB database, and (5) it reduces up to 99.1% filter space than original Learned Bloom Filter.
翻译:成员关系(成员查询/成员测试)是数据库、网络和安全领域中的基础性问题。然而,先前研究主要聚焦于近似解(如布隆过滤器)或精确方法(如完美哈希与字典),尚未尝试构建统一的理论体系。本文针对一般性成员问题提出了一套完整统一的理论——链式法则,该理论将近似成员关系和精确成员关系作为极端情况统一纳入框架。基于链式法则,我们进一步提出了一种简洁而通用的算法框架ChainedFilter,该框架能够在不损失信息的前提下组合不同基础过滤器。评估结果显示,ChainedFilter在多项应用中表现优异:(1)在隐式静态字典场景下,仅需超过理论下界26%的额外空间;(2)在无损数据压缩场景下,仅需超过理论下界0.22比特/元素的额外空间;(3)相比原始布谷鸟哈希,最多可减少31%的外存访问;(4)在RocksDB数据库中,在相同空间成本下,P99尾部点查询延迟比布隆过滤器降低36%;(5)相比原始学习型布隆过滤器,最多可减少99.1%的过滤器空间。