With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
翻译:随着科学文献的快速增长,科学问答(SciQA)对于探索和利用科学知识变得日益重要。检索增强生成(RAG)通过整合外部知识源来增强大语言模型,从而为科学问答提供可信的证据。然而,现有的检索和重排序方法在面对语义相似但逻辑无关的段落时仍然脆弱,这通常会降低事实可靠性并加剧幻觉问题。为应对这一挑战,我们提出了一种深度证据重排序智能体(DeepEra),它集成了逐步推理能力,能够超越表层语义对候选段落进行更精确的评估。为了支持系统性评估,我们构建了SciRAG-SSLI(科学RAG - 语义相似但逻辑无关)数据集,这是一个包含约30万个SciQA实例的大规模数据集,涵盖10个学科,基于1000万篇科学文献构建而成。该数据集结合了自然检索的上下文和系统生成的干扰项,以测试逻辑鲁棒性和事实基础。全面的评估证实,与领先的重排序方法相比,我们的方法实现了更优的检索性能。据我们所知,本研究首次全面研究并实证验证了两阶段RAG框架中不可忽视的SSLI问题。