Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

Large Language Models (LLMs) have demonstrated strong capabilities in biomedical question answering, yet their tendency to generate plausible but unverified claims poses serious risks in clinical settings. To mitigate these risks, the TREC 2025 BioGen track mandates grounded answers that explicitly surface contradictory evidence (Task A) and the generation of narrative driven, fully attributed responses (Task B). Addressing the absence of target ground truth, we present a proxy-based development framework using the SciFact dataset to systematically optimize retrieval architectures. Our iterative evaluation revealed a "Simplicity Paradox": complex adversarial dense retrieval strategies failed catastrophically at contradiction detection (MRR 0.023) due to Semantic Collapse, where negation signals become indistinguishable in vector space. We further identify a Retrieval Asymmetry: filtering dense embeddings improves contradiction detection but degrades support recall, compromising reliability. We resolve this via a Decoupled Lexical Architecture built on a unified BM25 backbone, balancing semantic support recall (0.810) with precise contradiction surfacing (0.750). This approach achieves the highest Weighted MRR (0.790) on the proxy benchmark while remaining the only viable strategy for scaling to the 30 million document PubMed corpus. For answer generation, we introduce Narrative Aware Reranking and One-Shot In-Context Learning, improving citation coverage from 50% (zero-shot) to 100%. Official TREC results confirm our findings: our system ranks 2nd on Task A contradiction F1 and 3rd out of 50 runs on Task B citation coverage (98.77%), achieving zero citation contradict rate. Our work transforms LLMs from stochastic generators into honest evidence synthesizers, showing that epistemic integrity in biomedical AI requires precision and architectural scalability isolated metric optimization.

翻译：大型语言模型（LLM）在生物医学问答中展现出强大能力，但其生成看似合理却未经核实的论断的倾向在临床应用中构成严重风险。为缓解这些风险，TREC 2025 BioGen赛道明确要求基于证据的答案（任务A：显式呈现矛盾证据）及生成叙事驱动、完全归因的响应（任务B）。针对目标标注缺失问题，我们提出基于SciFact数据集的代理开发框架，通过系统化优化检索架构。迭代评估揭示了"简单性悖论"：复杂对抗性稠密检索策略在矛盾检测中完全失效（MRR 0.023），其根源在于语义坍缩——向量空间中否定信号变得难以区分。我们进一步识别出检索非对称性：过滤稠密嵌入可提升矛盾检测性能，但会降低支持证据召回率，从而削弱系统可靠性。为此，我们提出基于统一BM25骨干网络的解耦词汇架构，在语义支持证据召回率（0.810）与精准矛盾呈现（0.750）间实现平衡。该方案在代理基准上取得最高加权MRR（0.790），且是唯一可扩展至PubMed三千万文献语料的可行策略。在答案生成方面，我们采用叙事感知重排序与单样本上下文学习，将引文覆盖率从50%（零样本）提升至100%。TREC官方结果证实了我们的发现：系统在任务A矛盾F1值上排名第二，任务B引文覆盖率（98.77%）在50组运行中位列第三，并实现零引文矛盾率。本研究将LLM从随机生成器转化为诚实的证据综合器，表明生物医学AI的认知完整性既需精确性，也需架构可扩展性，而非孤立的指标优化。