How to Train Your Filter: Should You Learn, Stack or Adapt?

Filters are ubiquitous in computer science, enabling space-efficient approximate membership testing. Since Bloom filters were introduced in 1970, decades of work improved their space efficiency and performance. Recently, three new paradigms have emerged offering orders-of-magnitude improvements in false positive rates (FPRs) by using information beyond the input set: (1) learned filters train a model to distinguish (non)members, (2) stacked filters use negative workload samples to build cascading layers, and (3) adaptive filters update internal representation in response to false positive feedback. Yet each paradigm targets specific use cases, introduces complex configuration tuning, and has been evaluated in isolation. This results in unclear trade-offs and a gap in understanding how these approaches compare and when each is most appropriate. This paper presents the first comprehensive evaluation of learned, stacked, and adaptive filters across real-world datasets and query workloads. Our results reveal critical trade-offs: (1) Learned filters achieve up to 10^2 times lower FPRs but exhibit high variance and lack robustness under skewed or dynamic workloads. Critically, model inference overhead leads to up to 10^4 times slower query latencies than stacked or adaptive filters. (2) Stacked filters reliably achieve up to 10^3 times lower FPRs on skewed workloads but require workload knowledge. (3) Adaptive filters are robust across settings, achieving up to 10^3 times lower FPRs under adversarial queries without workload assumptions. Based on our analysis, learned filters suit stable workloads where input features enable effective model training and space constraints are paramount, stacked filters excel when reliable query distributions are known, and adaptive filters are most generalizable, providing robust theoretically bound guarantees even in dynamic or adversarial environments.

翻译：过滤器在计算机科学中无处不在，能够实现空间高效的近似成员测试。自1970年布隆过滤器被提出以来，数十年的研究不断提升了其空间效率和性能。近期，三种新范式通过利用输入集合之外的信息，在误报率方面实现了数量级的改进：(1) 学习型过滤器训练模型以区分（非）成员，(2) 堆叠式过滤器利用负查询工作负载样本构建级联层，(3) 自适应过滤器根据误报反馈更新内部表示。然而，每种范式均针对特定用例，引入了复杂的配置调优，且评估方式相互孤立。这导致权衡关系不明确，并存在理解这些方法如何比较以及各自最适用场景的空白。本文首次对学习型、堆叠式和自适应过滤器在真实数据集和查询工作负载上进行了全面评估。我们的结果揭示了关键的权衡关系：(1) 学习型过滤器可实现高达10^2倍的更低误报率，但在偏斜或动态工作负载下表现出高方差且缺乏鲁棒性。关键的是，模型推理开销导致其查询延迟比堆叠式或自适应过滤器慢高达10^4倍。(2) 堆叠式过滤器在偏斜工作负载下能可靠实现高达10^3倍的更低误报率，但需要工作负载知识。(3) 自适应过滤器在各种设置下均具有鲁棒性，在无需工作负载假设的情况下，对抗性查询下可实现高达10^3倍的更低误报率。基于我们的分析，学习型过滤器适用于输入特征支持有效模型训练且空间约束至关重要的稳定工作负载；堆叠式过滤器在已知可靠查询分布时表现优异；自适应过滤器最具普适性，即使在动态或对抗性环境中也能提供鲁棒的理论有界保证。