Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.
翻译:任务专用模型构成了智能医疗系统的核心支柱,使智能体能够应对疾病诊断、病灶定位及报告生成等临床任务中的查询。然而,对于特定任务,单一"最优"模型几乎不存在。实践中,每个任务往往由多个相互竞争的专家模型共同支撑,不同模型在不同数据样本上表现优异。因此,针对任意给定查询,智能体必须从异构的工具候选池中可靠地选择恰当的专家模型。为此,我们提出ToolSelect方法,该方法通过使用任务条件选择损失的一致性代理函数最小化采样专家工具候选集的总体风险,从而自适应地学习工具模型选择机制。具体而言,我们设计了一种基于注意力神经过程的选择器,该选择器以查询和单模型行为摘要为条件,在专家模型中进行择优选择。鉴于现有测试平台的缺失,我们首次构建了配备多样化任务专用模型套件(17个疾病检测模型、19个报告生成模型、6个视觉定位模型及13个视觉问答模型)的智能胸片X光分析环境,并开发了包含1448个查询的基准测试集ToolSelectBench。实验结果表明,ToolSelect在四大任务类别中持续优于10种前沿方法。