Integrating tools into Large Language Models (LLMs) has facilitated the widespread application. Despite this, in specialized downstream task contexts, reliance solely on tools is insufficient to fully address the complexities of the real world. This particularly restricts the effective deployment of LLMs in fields such as medicine. In this paper, we focus on the downstream tasks of medical calculators, which use standardized tests to assess an individual's health status. We introduce MeNTi, a universal agent architecture for LLMs. MeNTi integrates a specialized medical toolkit and employs meta-tool and nested calling mechanisms to enhance LLM tool utilization. Specifically, it achieves flexible tool selection and nested tool calling to address practical issues faced in intricate medical scenarios, including calculator selection, slot filling, and unit conversion. To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical calculators to perform calculations and assess patient health status. CalcQA is constructed by professional physicians and includes 100 case-calculator pairs, complemented by a toolkit of 281 medical tools. The experimental results demonstrate significant performance improvements with our framework. This research paves new directions for applying LLMs in demanding scenarios of medicine.
翻译:将工具集成至大型语言模型(LLM)已推动其广泛应用。尽管如此,在专业下游任务场景中,仅依赖工具仍不足以完全应对现实世界的复杂性。这在医学等领域尤其限制了LLM的有效部署。本文聚焦于医学计算器的下游任务——该类任务通过标准化测试评估个体健康状况。我们提出MeNTi,一种面向LLM的通用智能体架构。MeNTi集成了专用医学工具包,并采用元工具与嵌套调用机制以增强LLM的工具利用能力。具体而言,它通过灵活的工具选择与嵌套工具调用机制,解决了复杂医疗场景中面临的实际问题,包括计算器选择、参数填充和单位转换等。为评估LLM在计算器场景临床全流程中的量化评估能力,我们构建了CalcQA基准。该基准要求LLM使用医学计算器执行计算并评估患者健康状况。CalcQA由专业医师构建,包含100组病例-计算器配对,并辅以包含281个医学工具的工具包。实验结果表明,我们的框架带来了显著的性能提升。本研究为LLM在医学高要求场景中的应用开辟了新方向。