Existing scientific document retrieval (SDR) methods primarily rely on document-centric representations learned from inter-document relationships for document-document (doc-doc) retrieval. However, the rise of LLMs and RAG has shifted SDR toward question-driven retrieval, where documents are retrieved in response to natural-language questions (q-doc). This change has led to systematic mismatches between document-centric models and question-driven retrieval, including (1) input granularity (long documents vs. short questions), (2) semantic focus (scientific discourse structure vs. specific question intent), and (3) training signals (citation-based similarity vs. question-oriented relevance). To this end, we propose UniFAR, a Unified Facet-Aware Retrieval framework to jointly support doc-doc and q-doc SDR within a single architecture. UniFAR reconciles granularity differences through adaptive multi-granularity aggregation, aligns document structure with question intent via learnable facet anchors, and unifies doc-doc and q-doc supervision through joint training. Experimental results show that UniFAR consistently outperforms prior methods across multiple retrieval tasks and base models, confirming its effectiveness and generality.
翻译:现有的科学文献检索方法主要依赖于从文献间关系中学到的以文献为中心的表示,用于文献-文献检索。然而,随着大语言模型和检索增强生成的兴起,科学文献检索正转向问题驱动的检索范式,即根据自然语言问题检索相关文献。这一转变导致了以文献为中心的模型与问题驱动检索之间存在系统性不匹配,具体包括:(1)输入粒度(长文献 vs. 短问题),(2)语义焦点(科学论述结构 vs. 具体问题意图),以及(3)训练信号(基于引用的相似性 vs. 面向问题的相关性)。为此,我们提出了UniFAR,一个统一的分面感知检索框架,旨在单一架构内联合支持文献-文献和问题-文献两种科学文献检索任务。UniFAR通过自适应多粒度聚合协调粒度差异,利用可学习的分面锚点对齐文献结构与问题意图,并通过联合训练统一文献-文献与问题-文献的监督信号。实验结果表明,UniFAR在多种检索任务和基础模型上均持续优于现有方法,验证了其有效性和通用性。