Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.
翻译:深度神经网络能够预测人类判断,但这并不意味着它们依赖类似人类的信号,或揭示了这些判断背后的线索。以往研究使用归因热力图来探讨这一问题,但此类解释的有效性本身取决于其鲁棒性。本文通过评估预测人类真实性评分的模型是否在架构内部及跨架构间产生一致的解释,来检验这种解释的鲁棒性。我们为多个冻结的预训练视觉模型拟合轻量级回归头,并使用Grad-CAM、LIME和多尺度像素掩膜生成归因图。多个架构在预测评分上表现良好,达到约80%的噪声上限。VGG模型通过追踪图像质量而非真实性特定方差实现这一性能,从而限制了其归因的相关性。在其余模型中,归因图在架构内随机种子之间总体稳定,尤其对于EfficientNetB3和Barlow Twins,且对于被认为更真实的图像,一致性更高。关键在于,即使预测性能相似,跨架构的归因一致性也很弱。为解决此问题,我们将模型组合成集成模型,这提高了对人类真实性判断的预测,并通过像素掩膜实现了图像级归因。我们得出结论:虽然深度网络能良好预测人类真实性判断,但无法为这些判断提供可识别的解释。更广泛而言,我们的发现表明,来自成功行为模型的事后解释应被视为认知机制方面的弱证据。