When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

翻译：医学图像分类面临一个根本性困境：虽然深度学习模型在大规模数据上表现出卓越性能，但实际临床场景常因标注成本、隐私约束和疾病罕见性而面临严重的数据稀缺问题。这一挑战在病理性疤痕分类中尤为突出——区分瘢痕疙瘩与增生性疤痕需要专业知识，且标注图像极其有限。我们提出一种新范式，将大语言模型（LLM）重新定位为知识驱动的特征工程师，而非端到端分类器。我们将此框架命名为ScaFE（疤痕特征工程）。核心洞察在于：LLM编码了丰富的医学知识，可外化为可执行的特征提取代码，从而将高维图像转化为低维的临床可解释表征。具体而言，我们基于既定疤痕评估标准对LLM进行提示，生成确定性Python代码，提取与温哥华疤痕量表等临床评分系统对齐的特征。该方法具有三大优势：(1) 数据高效性——通过将知识获取与统计学习解耦，使用有限训练样本即可实现稳健性能；(2) 隐私保护——原始图像本地处理，不暴露给外部LLM；及(3) 可解释性——特征基于临床推理显式构建。大量疤痕分类实验表明，在数据受限条件下，本方法始终优于端到端深度学习基线或使用LLM作为黑盒分类器的方法，为将LLM集成到数据高效且临床透明的医疗AI系统开辟了有前景的方向。