Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

翻译：近来多模态学习（包括大语言模型及视觉-语言模型）的进展已展现出在自然图像领域的强大适应性。然而，将其拓展至医学领域（尤其是三维体数据图像）仍面临重大挑战：高计算复杂度、三维体依赖关系，以及视觉特征与临床术语间的语义鸿沟。在有限医学数据上直接微调大语言模型易导致过拟合与临床幻觉——即语言流畅性被优先于临床事实性。本研究探索面向三维CT报告生成的参数高效适配策略，提出RAD3D-Prefix——一种轻量级诊断先验约束框架，可最大限度降低参数训练需求。该模块通过融合图像嵌入与多标签诊断分类对数几率，在弥合语义鸿沟的同时保留关键临床细节。通过保持大语言模型参数冻结，本方法仅需极少量可训练参数，有效降低小规模领域数据集上的过拟合风险。在涵盖96.1M至1.6B参数的跨尺度大语言模型系统研究中发现：微调对小型大语言模型最为有效，而冻结大型模型（约10亿参数级以上）并仅训练轻量投影层，可在性能、泛化能力与计算效率间实现更优平衡。在多项自动评估指标及临床阅片员研究中，RAD3D-Prefix均优于同等参数高效的基线方法，并在大幅减少可训练参数（相较完全微调方案）的同时展现出强大的跨域泛化能力。