Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.
翻译:生物医学问答(QA)日益需要对相互作用实体进行推理,其中支持性证据分散在生物医学知识图谱、文献文档和网络可访问资源中。然而,现有生物医学问答基准主要聚焦于考试型知识、文献理解或短程多跳推理,导致基于源条件的图推理和证据拓扑构建研究不足。为填补这一空白,我们提出BioMedHop,一个基于多源图的基准,用于评估结构化证据拓扑上的生物医学推理。BioMedHop包含10,045个实例,涵盖知识图谱、文献、网络及混合证据场景,涉及共享邻居匹配、交集推理、路径推理和计数,并提供选项型、开放式和数值计数等呈现形式。为支撑该基准,我们进一步提出BioWeave,一个源感知的推理框架,该框架检索生物医学知识图谱路径、从文献和网络来源收集支持线索、将其整合为统一证据图,并通过实体级证据支持验证答案。全面实验表明,BioWeave在BioMedHop上取得了所有对比方法的最佳整体性能,在整体平均值上比强混合基线ToG-2高出10.5%。此外,BioWeave持续改善不同大语言模型骨干网络,并使得Qwen3-4B等较小模型达到与GPT-4-Turbo相当的推理性能。