In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
翻译:本研究发现,残差型大语言模型作为编码器组件在生物医学成像任务中展现出出人意料的效能,而该领域传统上不含语言或文本数据。该方法跳脱既有技术框架,通过直接采用预训练大语言模型中的冻结Transformer模块作为创新编码层,实现对视觉token的直接处理。这一策略与标准多模态视觉-语言框架形成显著差异——后者通常依赖语言驱动的提示与输入。实验表明,此类大语言模型可作为即插即用的性能助推器,在涵盖2D与3D视觉分类任务的生物医学成像应用中提升模型表现。更具价值的是,作为衍生成果,本研究所提框架在MedMNIST-2D与MedMNIST-3D标准化数据集上达到最优性能,刷新多项基准。通过此项工作,我们旨在开拓大语言模型在生物医学成像领域的新应用路径,并深化对其在该专业领域中潜在价值的认知。