Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document relationships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, this paper introduces $\texttt{MixGR}$, which improves dense retrievers' awareness of query-document matching across various levels of granularity in queries and documents using a zero-shot approach. $\texttt{MixGR}$ fuses various metrics based on these granularities to a united score that reflects a comprehensive query-document similarity. Our experiments demonstrate that $\texttt{MixGR}$ outperforms previous document retrieval by 24.7%, 9.8%, and 6.9% on nDCG@5 with unsupervised, supervised, and LLM-based retrievers, respectively, averaged on queries containing multiple subqueries from five scientific retrieval datasets. Moreover, the efficacy of two downstream scientific question-answering tasks highlights the advantage of $\texttt{MixGR}$ to boost the application of LLMs in the scientific domain. The code and experimental datasets are available.
翻译:近期研究表明,通过弥补大型语言模型的知识鸿沟,文档检索在科学领域LLM生成(即RAG)中的重要性日益凸显。然而,稠密检索器在处理领域特定检索和复杂查询-文档关系时常常面临困难,尤其是当查询片段对应文档不同部分时。为缓解这一普遍性挑战,本文提出$\texttt{MixGR}$,该方法通过零样本方式增强稠密检索器对查询与文档间多粒度匹配的感知能力。$\texttt{MixGR}$融合了基于不同粒度的多种度量指标,生成反映综合查询-文档相似度的统一评分。实验表明,在包含多个子查询的五类科学检索数据集上,$\texttt{MixGR}$相较于现有文档检索方法,在无监督、有监督和基于LLM的检索器上分别平均提升nDCG@5指标24.7%、9.8%和6.9%。此外,两项下游科学问答任务的有效性验证了$\texttt{MixGR}$在促进LLM科学领域应用方面的优势。代码与实验数据集已开源。