Mixture of Experts Approaches in Dense Retrieval Tasks

from arxiv, 8 pages, 4 figures, 3 tables, reproducible code available at https://github.com/FaySokli/SB-MoE , Accepted for publication in Proceedings of the 2025 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2025)

Dense Retrieval Models (DRMs) are a prominent development in Information Retrieval (IR). A key challenge with these neural Transformer-based models is that they often struggle to generalize beyond the specific tasks and domains they were trained on. To address this challenge, prior research in IR incorporated the Mixture-of-Experts (MoE) framework within each Transformer layer of a DRM, which, though effective, substantially increased the number of additional parameters. In this paper, we propose a more efficient design, which introduces a single MoE block (SB-MoE) after the final Transformer layer. To assess the retrieval effectiveness of SB-MoE, we perform an empirical evaluation across three IR tasks. Our experiments involve two evaluation setups, aiming to assess both in-domain effectiveness and the model's zero-shot generalizability. In the first setup, we fine-tune SB-MoE with four different underlying DRMs on seven IR benchmarks and evaluate them on their respective test sets. In the second setup, we fine-tune SB-MoE on MSMARCO and perform zero-shot evaluation on thirteen BEIR datasets. Additionally, we perform further experiments to analyze the model's dependency on its hyperparameters (i.e., the number of employed and activated experts) and investigate how this variation affects SB-MoE's performance. The obtained results show that SB-MoE is particularly effective for DRMs with lightweight base models, such as TinyBERT and BERT-Small, consistently exceeding standard model fine-tuning across benchmarks. For DRMs with more parameters, such as BERT-Base and Contriever, our model requires a larger number of training samples to achieve improved retrieval performance. Our code is available online at: https://github.com/FaySokli/SB-MoE.

翻译：稠密检索模型是信息检索领域的一项重要进展。这类基于神经Transformer的模型面临的一个关键挑战是，它们往往难以泛化到训练任务和领域之外。为解决这一挑战，先前的信息检索研究在稠密检索模型的每个Transformer层中引入了专家混合框架，该方法虽有效但显著增加了额外参数量。本文提出一种更高效的设计方案，即在最终Transformer层后引入单一专家混合块。为评估SB-MoE的检索效能，我们在三个信息检索任务上进行了实证评估。实验包含两种评估设置，旨在同时评估模型在领域内的效能及其零样本泛化能力。在第一组设置中，我们基于七项信息检索基准测试，使用四种不同的底层稠密检索模型对SB-MoE进行微调，并在相应测试集上评估性能。在第二组设置中，我们在MSMARCO数据集上微调SB-MoE，并在十三组BEIR数据集上进行零样本评估。此外，我们通过进一步实验分析了模型对其超参数的依赖性，并探究了专家数量与激活专家数的变化如何影响SB-MoE的性能。实验结果表明，SB-MoE对于采用轻量级基础模型的稠密检索模型具有显著优势，在TinyBERT和BERT-Small等模型上持续超越标准微调方法。对于BERT-Base和Contriever等参数量更大的模型，我们的方法需要更多训练样本才能实现检索性能的提升。代码已开源：https://github.com/FaySokli/SB-MoE。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日