A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.
翻译:过滤器是一种广泛应用的数据结构,用于存储来自某个可数全域$U$的给定元素集合$S$的近似表示。它表示一个超集$S'\supseteq S$,该超集在“接近$S$”的意义上满足:对于$x\not\in S$,$x\in S'$的概率以某个$\varepsilon > 0$为界。当允许一定误判率时,使用Bloom过滤器的优势在于其空间占用小于精确存储$S$所需的空间。尽管从最坏情况角度分析过滤器已有充分研究,但显然现有最优构造对于特定的数据和查询分布可能并非接近最优。例如,假设某些元素以接近1的概率属于$S$,那么将它们始终包含在$S'$中将是合理的,通过不在过滤器中表示这些元素来节省空间。此类问题已在加权Bloom过滤器(Bruck, Gao and Jiang, ISIT 2006)以及利用学习组件的Bloom过滤器实现(Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021)的背景下被提出。本文中,我们给出了此类过滤器所需期望空间的下界。我们还通过展示一种过滤器构造证明该下界是渐近紧的,该构造在最坏情况下以常数时间执行查询和插入操作,并且对于从乘积分布中抽取的输入集合,其误判率以高概率不超过$\varepsilon$。我们还提出了一种Bloom过滤器的替代方案,称为$\textit{Daisy Bloom filter}$,其执行操作速度更快,且比标准Bloom过滤器显著节省空间。