Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
翻译:检索增强生成(RAG)已成为支撑大型语言模型(LLM)的基础技术,它能够促进知识更新并减少幻觉生成。近期,基于LLM的检索器模型在RAG应用中展现出最先进的性能。然而,关于如何将通用LLM适配为高效的领域专用检索器,尤其是在生物医学等专业领域,多个技术层面仍未得到充分探索。我们提出了合成-训练-融合(STM)框架,这是一个模块化框架,通过合成困难负样本、检索提示优化和模型融合来增强仅解码器LLM。在MTEB基准测试的12项医学与通用任务子集上的实验表明,STM将任务专用专家模型的性能提升最高达23.5%(平均7.5%),并生成出优于单一专家模型及强基线模型的融合模型,且无需大量预训练。我们的研究结果为将通用LLM转化为高性能、领域专用的检索器提供了一条可扩展且高效的路径,在保持通用领域能力的同时,在专业任务上表现出色。