MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

翻译：临床医学中的治疗决策构成了高风险领域，其中AI引导需与患者特征、疾病进程及药理制剂间的复杂交互协同作用。诸如药物推荐、治疗方案制定及不良反应预测等任务，要求基于可靠生物医学知识进行稳健的多步推理。以TxAgent为代表的代理型AI方法，通过迭代检索增强生成（RAG）应对这些挑战。TxAgent采用微调后的Llama-3.1-8B模型，能动态生成并执行针对统一生物医学工具套件（ToolUniverse）的函数调用，整合FDA药品API、OpenTargets及Monarch资源，确保获取最新治疗信息。与通用RAG系统相比，医学应用施加了严格的安全约束，使得推理轨迹与工具调用序列的准确性至关重要。这些考量催生了将词元级推理与工具使用行为视为显式监督信号的评估方案。本文呈现了我们参与NeurIPS 2025 CURE-Bench挑战赛的见解——该竞赛通过评估正确性、工具利用及推理质量的指标，对治疗推理系统进行基准测试。我们分析了函数（工具）调用的检索质量如何影响整体模型性能，并展示了通过改进工具检索策略所取得的性能提升。本项工作荣获开放科学卓越奖。完整信息请见https://curebench.ai/。