Can Modern LLMs Act as Agent Cores in Radiology~Environments?

Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in https://github.com/MAGIC-AI4Med/RadABench.

翻译：大型语言模型（LLMs）的进展为基于LLM的智能体系统铺平了道路，这些系统在多个领域提供了更高的准确性和可解释性。放射学因其复杂的分析需求，是应用此类智能体的理想领域。本文旨在探讨构建具体放射学智能体的先决问题，即“现代LLMs能否作为放射学环境中的智能体核心？”为此，我们提出了RadABench，其贡献包括三个方面：首先，我们介绍了RadABench-Data，这是一个为基于LLM的智能体构建的综合性合成评估数据集，该数据集基于一个涵盖6种解剖结构、5种成像模态、10种工具类别和11项放射学任务的广泛分类法生成。其次，我们提出了RadABench-EvalPlat，这是一个新颖的智能体评估平台，具有提示驱动的工作流程，并能够模拟广泛的放射学工具集。第三，我们从5个角度使用多种指标评估了7种领先LLM在我们基准测试上的性能。我们的研究结果表明，尽管当前LLM在许多方面展现出强大的能力，但其仍不足以作为完全可运行的放射学智能体系统中的核心智能体核心。此外，我们识别了影响基于LLM的智能体核心性能的关键因素，为临床医生如何在真实世界的放射学实践中有效应用智能体系统提供了见解。我们所有的代码和数据均已开源，地址为：https://github.com/MAGIC-AI4Med/RadABench。