In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.
翻译:近年来,向量搜索的主流精度指标是固定大小结果列表的召回率(top-k检索),以精确向量检索结果作为基准。尽管该指标计算简便,但与集成向量搜索的完整系统的端到端精度关联较弱。本文聚焦于需要根据向量检索结果进行硬性决策的常见场景(例如判断查询图像是否匹配数据库图像),将该问题建模为范围搜索任务——返回查询向量指定半径内的所有向量。我们证明,基于查询与向量之间的距离可严格建模范围搜索结果的价值,由此提出范围搜索指标RSM(Range Search Metric),该指标既具有理论依据,又无需运行端到端评估即可便捷计算。我们将该指标应用于图像检索场景,发现适用于top-k检索的索引方法未必能最大化RSM。特别地,对于基于倒排文件的索引结构,访问有限数量的聚类并采用紧凑编码向量可取得接近最优的结果。