CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion

Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of real-world data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithms ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities. We have made the datasets and implementation methods available at: https://github.com/intellistream/candy.

翻译：近似K最近邻（AKNN）算法在信息检索、计算机视觉和自然语言处理等多种人工智能应用中发挥着关键作用。尽管近期已开发出大量AKNN算法及其评估基准，但现实世界数据的动态特性带来了现有基准未能解决的重要挑战。传统基准主要评估静态环境下的检索效果，往往忽视更新效率——这对处理连续数据摄入至关重要。这一局限导致对AKNN算法适应数据模式变化能力的评估不完整，从而限制了对算法在动态环境中性能表现的深入理解。为弥补这些不足，我们提出了CANDY——一种专为动态数据摄入的连续近似最近邻搜索设计的基准。CANDY全面评估各类AKNN算法，整合了机器学习驱动推理替代传统启发式扫描等先进优化技术，以及改进的距离计算方法以降低计算开销。我们在多样化数据集上的广泛实验表明，在召回率与延迟指标上，较简单的AKNN基线方法常优于更复杂的替代方案。这些发现对“高性能必须依赖算法复杂性”的传统认知提出了挑战。此外，我们的研究结果揭示了现有挑战，并指明了未来研究方向。相关数据集与实现方法已开源：https://github.com/intellistream/candy。