The ever-increasing size of language models curtails their widespread availability to the community, thereby galvanizing many companies into offering access to large language models through APIs. One particular type, suitable for dense retrieval, is a semantic embedding service that builds vector representations of input text. With a growing number of publicly available APIs, our goal in this paper is to analyze existing offerings in realistic retrieval scenarios, to assist practitioners and researchers in finding suitable services according to their needs. Specifically, we investigate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. For this purpose, we evaluate these services on two standard benchmarks, BEIR and MIRACL. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English, in contrast to the standard practice of employing them as first-stage retrievers. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost. We hope our work lays the groundwork for evaluating semantic embedding APIs that are critical in search and more broadly, for information access.
翻译:随着语言模型规模的持续增大,其广泛普及受到限制,这促使众多公司通过应用程序编程接口(API)提供大型语言模型的访问服务。其中,适用于稠密检索的语义嵌入服务可构建输入文本的向量表示。面对日益增多的公开API,本文旨在分析现有服务在真实检索场景中的表现,以帮助从业者和研究人员根据需求选择合适服务。具体而言,我们研究了现有语义嵌入API在领域泛化与多语言检索方面的能力。为此,我们在两个标准基准测试集BEIR和MIRACL上对这些服务进行了评估。研究发现,使用API对BM25结果进行重排序是一种经济有效的方法,且在英语场景中表现最佳,这与直接将其用作首轮检索器的常规做法形成对比。对于非英语检索,重排序仍能提升结果,但采用BM25混合模型的效果最优,尽管成本更高。我们希望这项工作能为评估语义嵌入API奠定基础——这些API在搜索及更广泛的信息获取领域具有关键作用。