In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in computer vision. Our first key contribution is a human evaluation of XAI explanations on four diverse datasets (COCO, Pascal Parts, Cats Dogs Cars, and MonumAI) which constitutes the first large-scale benchmark dataset for XAI, with annotations at both the image and concept levels. This dataset allows for robust evaluation and comparison across various XAI methods. Our second major contribution is a data-based metric for assessing the interpretability of explanations. It mimics human preferences, based on a database of human evaluations of explanations in the PASTA-dataset. With its dataset and metric, the PASTA framework provides consistent and reliable comparisons between XAI techniques, in a way that is scalable but still aligned with human evaluations. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. Our findings indicate that humans tend to prefer saliency maps over other explanation types. Moreover, we provide evidence that human assessments show a low correlation with existing XAI metrics that are numerically simulated by probing the model.
翻译:本文提出PASTA(人工智能解释感知评估系统),一种面向计算机视觉XAI技术的人本评估新框架。我们的首要贡献在于对四种多样化数据集(COCO、Pascal Parts、Cats Dogs Cars和MonumAI)的XAI解释进行了人类评估,构建了首个大规模XAI基准数据集,其中包含图像级和概念级标注。该数据集支持对各种XAI方法进行稳健评估与比较。我们的第二项主要贡献是提出基于数据的解释可解释性评估指标,该指标通过PASTA数据集中人类解释评估数据库来模拟人类偏好。凭借其数据集与评估指标,PASTA框架能以可扩展且保持人类评估一致性的方式,为XAI技术提供稳定可靠的比较基准。此外,我们的基准测试首次实现了跨模态解释的比较研究。实验结果表明,相较于其他解释类型,人类更倾向于选择显著图。进一步研究发现,人类评估结果与现有通过模型探针数值模拟的XAI评估指标之间相关性较低。