Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.
翻译:概念模型天然适用于开发具有内在可解释性的皮肤病变诊断,因为医学专家通常基于病灶的一系列视觉模式做出决策。然而,这类模型的开发依赖于概念标注数据集的存在,而由于标注过程需要专业知识和技能,此类数据集的可用性十分有限。本研究证明,视觉语言模型可用于缓解对大量概念标注样本的依赖。具体而言,我们提出了一种嵌入学习策略,将CLIP模型适配至皮肤病变分类的下游任务,通过将基于概念的描述作为文本嵌入。实验表明,当使用概念作为文本嵌入时,视觉语言模型不仅取得了更优的准确率,而且仅需较少数量的概念标注样本即可达到与专为自动概念生成设计的方案相当的性能。