Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
翻译:视觉语言模型在医学应用中的重要性日益凸显;然而,其在皮肤病学领域的评估仍受限于主要关注图像级分类任务(如皮损识别)的数据集。尽管此类数据集对识别任务具有价值,但无法全面评估多模态模型的视觉理解、语言基础与临床推理能力。需要视觉问答基准来评估模型如何解读皮肤病学图像、对细粒度形态特征进行推理并生成具有临床意义的描述。我们提出了DermaBench,一个基于多样化皮肤病图像数据集构建的临床专家标注皮肤病学VQA基准。DermaBench包含来自570名不同患者的656张临床图像,覆盖Fitzpatrick皮肤分型I-VI。通过采用包含22个主要问题(单选、多选及开放式)的分层标注框架,皮肤科专家对每张图像的诊断、解剖部位、皮损形态、分布、表面特征、颜色及图像质量进行了标注,同时提供开放式叙述性描述与总结,共产生约14,474条VQA风格标注。为遵循上游许可协议,DermaBench以纯元数据形式发布,可在哈佛Dataverse公开获取。