Embeddings extracted by pre-trained Large Language Models (LLMs) have significant potential to improve information retrieval and search. Beyond the zero-shot setup in which they are being conventionally used, being able to take advantage of the information from the relevant query-corpus paired data can further boost the LLM capabilities. In this paper, we propose a novel method, Search-Adaptor, for customizing LLMs for information retrieval in an efficient and robust way. Search-Adaptor modifies the embeddings generated by pre-trained LLMs, and can be integrated with any LLM, including those only available via prediction APIs. On multiple English, multilingual, and multimodal retrieval datasets, we show consistent and significant performance benefits for Search-Adaptor -- e.g., more than 5% improvements for Google Embedding APIs in nDCG@10 averaged over 14 BEIR datasets.
翻译:预训练大型语言模型提取的嵌入向量具有显著提升信息检索与搜索能力的潜力。除了当前常规使用的零样本设置外,若能利用相关查询-语料配对数据中的信息,将能进一步激发大型语言模型的潜能。本文提出一种新颖方法——搜索适配器,能以高效且稳健的方式为信息检索任务定制大型语言模型。搜索适配器通过修改预训练大型语言模型生成的嵌入向量实现定制化,并能与任何大型语言模型集成,包括仅通过预测API访问的模型。在多个英文、多语言及多模态检索数据集上的实验表明,搜索适配器能带来持续且显著的性能提升——例如在14个BEIR数据集上,谷歌嵌入API的nDCG@10指标平均提升超过5%。