Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-$δ$@K, a new metric that captures the fraction of queries with recall above a threshold $δ$. This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-$δ$@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.
翻译:向量数据库是人工智能系统的关键基础设施,平均召回率是其评估的主流指标。用户与研究者都依赖该指标进行系统选择与优化。我们证明依赖平均召回率存在缺陷:该指标掩盖了查询间的性能差异,使得均值表现优异的系统在困难查询上可能显著失效。这些极端情况不仅混淆用户,还可能导致下游应用(如RAG)失败。我们认为,鲁棒性——即跨查询稳定实现可接受召回率的能力——对于向量数据库评估至关重要。我们提出Robustness-δ@K这一新指标,用于度量召回率超过阈值δ的查询占比。该指标能更深入地揭示召回率分布特征,有助于根据应用需求选择向量索引,并指导尾部性能优化。我们将Robustness-δ@K集成到现有基准测试中,评估主流向量索引,揭示出显著的鲁棒性差异。实验表明,即使平均召回率相同,更鲁棒的向量索引能带来更优的应用性能。我们还识别出影响鲁棒性的设计因素,为提升实际性能提供了指导。