The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-k retrieval similarity reveals high-variance at low k values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.
翻译:嵌入模型的选择是设计检索增强生成(RAG)系统的关键步骤。鉴于可用选项数量庞大,识别相似模型的聚类可以简化模型选择过程。仅依赖基准性能分数只能对模型相似性进行弱评估。因此,在本研究中,我们在RAG系统的背景下评估嵌入模型的相似性。我们的评估分为两个方面:我们使用中心核对齐方法在成对水平上比较嵌入。此外,鉴于其与RAG系统尤其相关,我们使用Jaccard相似度和排序相似度来评估这些模型之间检索结果的相似性。我们在来自流行的基准信息检索(BEIR)的五个数据集上比较了不同系列的嵌入模型,包括专有模型。通过实验,我们识别出了对应于模型系列的模型聚类,但有趣的是,也发现了一些跨系列的聚类。此外,我们对top-k检索相似性的分析揭示了在低k值下的高方差。我们还确定了专有模型可能的开源替代方案,其中Mistral模型与OpenAI模型表现出最高的相似性。