This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.
翻译:本研究提出了一种面向特定领域的基于大语言模型的可视化解释评估框架,用于评估面部皮肤病诊断模型中Grad-CAM解释的质量。以往研究主要聚焦于通过数据增强技术提升分类性能,但鲜有研究系统性地探讨模型解释是否基于临床相关的病变区域。本研究分别采用几何增强、颜色增强及混合增强策略,基于EfficientNet-B0、MobileNetV3和ResNet18构建面部皮肤病分类模型,并运用Grad-CAM生成反映模型决策过程的可视化解释。此外,本研究设计了基于GPT-5.5、Gemini 3.5 Flash及Claude Sonnet 4.6的“大语言模型作为裁判”(LLM-as-a-Judge)评估框架,从病变定位准确性与解释可信度两个维度对Grad-CAM解释进行评估。为提升评估一致性与临床依据性,本研究引入渐进式提示工程策略,融合评估准则、临床知识、惩罚规则及结构化输出格式。