In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
翻译:在本研究中,我们揭示了基于残差的大语言模型(LLMs)作为编码器组成部分在生物医学成像任务中的意外效能,该领域传统上不涉及语言或文本数据。该方法采用从预训练LLMs中提取的冻结Transformer模块作为创新编码器层,直接处理视觉令牌,与标准的多模态视觉-语言框架(通常依赖语言驱动的提示和输入)形成显著差异。我们发现,这些LLMs可作为即插即用的助推器,提升包括2D和3D视觉分类任务在内的多种生物医学成像应用的性能。更有趣的是,作为副产品,我们发现所提出的框架在MedMNIST-2D和3D的大规模标准化数据集上达到了新的最优性能。通过此项工作,我们旨在为LLMs在生物医学成像中的应用开辟新途径,并加深对其在该专业领域中潜在能力的理解。