Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model's weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.
翻译:尽管生成式大语言模型(LLMs)取得了进展,但专用对话式人工智能代理的实际应用仍受限于计算成本、延迟要求以及对精确领域特定相关性度量的需求。虽然现有嵌入模型解决了前两个约束条件,但在金融等专业领域的信息检索方面表现欠佳。本文提出了一种可扩展的流水线方法,以通用检索嵌入模型为基础,从未标记语料库中训练专用模型。我们的方法在14种财务报告类型上(基于21,800个查询-文档对测量)实现了MRR@5平均提升27.7%、平均DCG@5提升44.6%,并在FinanceBench的4个文档类别中有3个获得了改进的NDCG指标。我们针对检索增强生成(RAG)而非LLM生成器,对检索嵌入模型(双编码器)进行适配,利用LLM判定的相关性将领域知识蒸馏至紧凑的检索器中。现有研究通过将合成生成的查询与真实段落配对来直接微调检索模型。本流水线方法的创新之处在于引入了学生模型与教师模型之间的交互机制:基于检索从未标记语料库中挖掘困难正/负例样本,并利用这些样本对学生模型权重进行迭代重训练,二者交替进行。每次检索迭代均使用优化后的学生模型从语料库中挖掘更困难的训练样本,供后续训练迭代使用。该方法为弥合通用模型与专业领域之间的差距提供了一种经济高效的解决方案,且无需耗费大量人力进行人工标注。