Biomedical knowledge resources often either preserve evidence as unstructured text or compress it into flat triples that omit study design, provenance, and quantitative support. Here we present EvidenceNet, a disease-specific dataset of record-level evidence collections and corresponding graph representations derived from full-text biomedical literature. EvidenceNet uses a large language model (LLM)-assisted pipeline to extract experimentally grounded findings as structured evidence records, normalize biomedical entities, score evidence quality, and connect related records through typed semantic relations. We release EvidenceNet-HCC with 7,872 evidence records and a corresponding graph with 10,328 nodes and 49,756 edges, and EvidenceNet-CRC with 6,622 records and a corresponding graph with 8,795 nodes and 39,361 edges. Technical validation shows high component fidelity, including 98.3% field-level extraction accuracy, 100.0% high-confidence entity-link accuracy, 87.5% fusion integrity, and 90.0% semantic relation-type accuracy. Downstream analyses show that the data support retrieval-augmented question answering and graph-based tasks such as future link prediction and target prioritization. These results establish EvidenceNet as a disease-specific biomedical knowledge base dataset for evidence-aware analysis and reuse.
翻译:生物医学知识资源通常要么将证据保存为非结构化文本,要么将其压缩为忽略研究设计、来源和定量支持的扁平三元组。本文提出EvidenceNet——一种基于全文生物医学文献构建的,包含记录级证据集合及对应图表示形式的疾病特定数据集。EvidenceNet采用大语言模型辅助管道,将实验验证的发现提取为结构化证据记录,标准化生物医学实体,评估证据质量,并通过类型化语义关系连接相关记录。我们发布了包含7,872条证据记录及对应图(含10,328个节点和49,756条边)的EvidenceNet-HCC数据集,以及包含6,622条记录及对应图(含8,795个节点和39,361条边)的EvidenceNet-CRC数据集。技术验证表明组件具有高保真度:字段级提取准确率达98.3%,高置信度实体链接准确率达100.0%,融合完整性达87.5%,语义关系类型准确率达90.0%。下游分析显示,该数据可支持检索增强式问答及未来链接预测和目标优先级排序等图任务。这些结果确立了EvidenceNet作为面向证据感知分析与复用的疾病特定生物医学知识库数据集地位。