Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.
翻译:临床决策需要对异质性证据进行细致推理并提供可追溯的论证依据。尽管近期基于大语言模型的多智能体系统展现出潜力,但其主要优化结果准确性,而忽视了符合临床标准的基于过程的推理。基因-疾病有效性评估是这一问题的关键现实案例,专家必须通过整合多元生物医学证据来判断基因是否与疾病存在因果关联。针对该任务,我们提出一种工具化智能体强化学习框架,其具备双重目标:(1) 通过过程级监督确保推理遵循有效的临床路径;(2)通过分层多智能体系统实现高效协同。在ClinGen数据集上的评估表明:当仅使用结果奖励时,采用GRPO训练的Qwen3-4B监督智能体的多智能体系统将最终结果准确率从基础模型监督器的0.195显著提升至0.732,但过程对齐性较差(F1值0.392);反之,当结合过程与结果奖励时,采用GRPO训练监督器的多智能体系统在将结果准确率提升至0.750的同时,显著将过程保真度提高至0.520 F1值。代码已开源:https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents。