Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report

Purpose: To introduce the concept of using large language models (LLMs) to re-label structure names in accordance with the American Association of Physicists in Medicine (AAPM) Task Group (TG)-263 standard, and to establish a benchmark for future studies to reference. Methods and Materials: The Generative Pre-trained Transformer (GPT)-4 application programming interface (API) was implemented as a Digital Imaging and Communications in Medicine (DICOM) storage server, which upon receiving a structure set DICOM file, prompts GPT-4 to re-label the structure names of both target volumes and normal tissues according to the AAPM TG-263. Three disease sites, prostate, head and neck, and thorax were selected for evaluation. For each disease site category, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50) and 50 patients were randomly selected for evaluation. Structure names that were considered were those that were most likely to be relevant for studies utilizing structure contours for many patients. Results: The overall re-labeling accuracy of both target volumes and normal tissues for prostate, head and neck, and thorax cases was 96.0%, 98.5%, and 96.9% respectively. Re-labeling of target volumes was less accurate on average except for prostate - 100%, 93.1%, and 91.1% respectively. Conclusions: Given the accuracy of GPT-4 in re-labeling structure names of both target volumes and normal tissues as presented in this work, LLMs are poised to be the preferred method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.

翻译：目的：引入使用大语言模型（LLMs）根据美国医学物理学家协会（AAPM）任务组（TG）-263标准重新标记结构名称的概念，并为未来研究建立基准参考。方法：采用生成式预训练Transformer（GPT-4）应用程序编程接口（API）构建医学数字成像与通信（DICOM）存储服务器。当接收到结构集DICOM文件时，该服务器提示GPT-4根据AAPM TG-263标准重新标记靶区与正常组织的结构名称。选取前列腺、头颈部和胸部三个病种进行评估。每个病种类别中，随机选取150名患者用于人工调整提示指令（每批50例），另随机选取50名患者用于评估。纳入评估的结构名称为最可能与利用多个患者结构轮廓的研究相关的名称。结果：前列腺、头颈部和胸部病例中靶区与正常组织的整体重新标记准确率分别为96.0%、98.5%和96.9%。除前列腺（100%）外，靶区重新标记的平均准确率较低，分别为93.1%和91.1%。结论：鉴于本研究中GPT-4在靶区和正常组织结构名称重新标记方面展现的准确率，考虑到大语言模型能力的持续快速进步，LLMs有望成为放射肿瘤学中标准化结构名称的首选方法。