Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset.
翻译:当前学术研究实践中,通常依赖在MSMARCO等现有大型数据集上训练密集检索器,随后探索提升其向未见领域零样本泛化能力的方法。尽管现有研究通过数据增强、架构修改、模型规模扩展甚至基座模型预训练等资源密集型手段应对这一挑战,但鲜有研究系统考察能否直接改进训练流程本身以增强模型泛化能力。本研究提出训练密集编码器的简单方法:采用LoRA等参数高效方法在MSMARCO上进行训练,若无精心构建的难例负样本,则优先使用批内负样本策略。我们基于BEIR基准验证了这些建议,发现其结果在不同密集编码器架构与基座模型规模下保持稳定,并能与架构修改或额外预训练等资源密集型跨域泛化策略形成互补效应。我们希望这项系统考察多种训练技术、同时可增强其他资源密集型方法的研究,能为开发即便仅依赖单数据集训练也能有效泛化的密集检索模型提供实践性指导。