Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations and the substantial mismatch in vocabulary and information needs. Recent approaches address these issues through two independent directions that leverage large language models (LLMs): (1) generating synthetic queries for fine-tuning, and (2) generating auxiliary contexts to support relevance matching. However, both directions overlook the diverse academic concepts embedded within scientific documents, often producing redundant or conceptually narrow queries and contexts. To address this limitation, we introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy. This structured index serves as a foundation for improving both directions. First, we enhance the synthetic query generation with concept coverage-based generation (CCQGen), which adaptively conditions LLMs on uncovered concepts to generate complementary queries with broader concept coverage. Second, we strengthen the context augmentation with concept-focused auxiliary contexts (CCExpand), which leverages a set of document snippets that serve as concise responses to the concept-aware CCQGen queries. Extensive experiments show that incorporating the academic concept index into both query generation and context augmentation leads to higher-quality queries, better conceptual alignment, and improved retrieval performance.
翻译:将通用领域检索器适配至科学领域面临双重挑战:一方面,大规模领域相关性标注数据稀缺;另一方面,领域间存在显著的词汇体系与信息需求差异。近期研究通过两个独立方向利用大语言模型(LLMs)应对这些问题:(1)通过生成合成查询进行微调;(2)生成辅助上下文以支持相关性匹配。然而,这两种方向均忽视了科学文献中蕴含的多样化学术概念,往往产生冗余或概念覆盖狭窄的查询与上下文。为突破此局限,本文提出学术概念索引,其从论文中提取关键概念,并依据学术分类体系进行结构化组织。该结构化索引为改进上述两个方向奠定了基础。首先,我们提出基于概念覆盖的查询生成方法(CCQGen),通过使LLMs自适应地关注未覆盖概念,生成具有更广概念覆盖的互补性查询。其次,我们开发概念聚焦的上下文增强方法(CCExpand),利用一组文档片段作为对概念感知型CCQGen查询的简明响应。大量实验表明,将学术概念索引融入查询生成与上下文增强过程,能够产生更高质量的查询、实现更优的概念对齐,并显著提升检索性能。