Exploring Multilingual Large Language Models for Enhanced TNM classification of Radiology Report in lung cancer staging

Background: Structured radiology reports remains underdeveloped due to labor-intensive structuring and narrative-style reporting. Deep learning, particularly large language models (LLMs) like GPT-3.5, offers promise in automating the structuring of radiology reports in natural languages. However, although it has been reported that LLMs are less effective in languages other than English, their radiological performance has not been extensively studied. Purpose: This study aimed to investigate the accuracy of TNM classification based on radiology reports using GPT3.5-turbo (GPT3.5) and the utility of multilingual LLMs in both Japanese and English. Material and Methods: Utilizing GPT3.5, we developed a system to automatically generate TNM classifications from chest CT reports for lung cancer and evaluate its performance. We statistically analyzed the impact of providing full or partial TNM definitions in both languages using a Generalized Linear Mixed Model. Results: Highest accuracy was attained with full TNM definitions and radiology reports in English (M = 94%, N = 80%, T = 47%, and ALL = 36%). Providing definitions for each of the T, N, and M factors statistically improved their respective accuracies (T: odds ratio (OR) = 2.35, p < 0.001; N: OR = 1.94, p < 0.01; M: OR = 2.50, p < 0.001). Japanese reports exhibited decreased N and M accuracies (N accuracy: OR = 0.74 and M accuracy: OR = 0.21). Conclusion: This study underscores the potential of multilingual LLMs for automatic TNM classification in radiology reports. Even without additional model training, performance improvements were evident with the provided TNM definitions, indicating LLMs' relevance in radiology contexts.

翻译：背景：由于结构化处理过程劳动密集且报告多采用叙述风格，结构化放射学报告的发展仍显不足。深度学习，特别是像GPT-3.5这样的大语言模型（LLMs），为自动化处理自然语言放射学报告的结构化提供了前景。然而，尽管有报道指出LLMs在英语以外的语言中效果较差，但其在放射学领域的性能尚未得到广泛研究。目的：本研究旨在探讨基于放射学报告使用GPT3.5-turbo（GPT3.5）进行TNM分类的准确性，以及多语言LLMs在日语和英语中的实用性。材料与方法：利用GPT3.5，我们开发了一个系统，用于从肺癌胸部CT报告中自动生成TNM分类并评估其性能。我们使用广义线性混合模型，统计分析了以两种语言提供完整或部分TNM定义的影响。结果：在使用完整TNM定义和英文放射学报告时获得了最高准确率（M = 94%，N = 80%，T = 47%，总体 = 36%）。为T、N、M各因素提供定义，在统计上显著提高了其各自的准确率（T：比值比（OR）= 2.35，p < 0.001；N：OR = 1.94，p < 0.01；M：OR = 2.50，p < 0.001）。日语报告的N和M准确率有所下降（N准确率：OR = 0.74；M准确率：OR = 0.21）。结论：本研究强调了多语言LLMs在放射学报告中自动进行TNM分类的潜力。即使没有额外的模型训练，通过提供TNM定义也能观察到性能提升，这表明LLMs在放射学领域具有应用价值。