Saliency maps can explain a neural model's predictions by identifying important input features. They are difficult to interpret for laypeople, especially for instances with many features. In order to make them more accessible, we formalize the underexplored task of translating saliency maps into natural language and compare methods that address two key challenges of this approach -- what and how to verbalize. In both automatic and human evaluation setups, using token-level attributions from text classification tasks, we compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations (heatmap visualizations and extractive rationales), measuring simulatability, faithfulness, helpfulness and ease of understanding. Instructing GPT-3.5 to generate saliency map verbalizations yields plausible explanations which include associations, abstractive summarization and commonsense reasoning, achieving by far the highest human ratings, but they are not faithfully capturing numeric information and are inconsistent in their interpretation of the task. In comparison, our search-based, model-free verbalization approach efficiently completes templated verbalizations, is faithful by design, but falls short in helpfulness and simulatability. Our results suggest that saliency map verbalization makes feature attribution explanations more comprehensible and less cognitively challenging to humans than conventional representations.
翻译:显著性图可通过识别重要输入特征来解释神经模型的预测。但对于非专业用户而言,尤其在特征数量众多的实例中,这些图表难以理解。为提高可理解性,我们形式化了将显著性图转化为自然语言这一尚未充分探索的任务,并比较了解决该任务两个关键挑战——语言化内容与语言化方式——的方法。在文本分类任务的词元级归因基础上,通过自动评估与人工评估两种设置,我们对比了两种新方法(基于搜索的语言化方法与基于指令的语言化方法)与传统特征重要性表示(热力图可视化与提取式解释),并从可模拟性、忠实性、有用性及易理解性四个维度进行度量。基于GPT-3.5生成的显著性图语言化可产生包含关联关系、抽象概括及常识推理的合理解释,其人工评分显著高于其他方法,但未能忠实捕捉数值信息,且对任务的理解存在不一致性。相比之下,我们提出的基于搜索的无模型语言化方法可高效完成模板化语言生成,且其设计保证了内在忠实性,但在有用性与可模拟性方面存在不足。实验结果表明,与传统表示方式相比,显著性图语言化能使特征归因解释更易理解,并降低用户的认知负荷。