Interpretability Illusions in the Generalization of Simplified Models

A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

翻译：研究深度学习系统的常用方法之一是使用简化模型表示——例如，利用奇异值分解将模型的隐藏状态投影到低维空间进行可视化。这种方法假设简化结果忠实于原始模型。在此，我们揭示了该假设的一个重要局限：即使简化表示能够准确近似训练集上的完整模型，它们也可能无法正确捕捉模型在分布外数据上的行为——基于简化表示形成的理解可能是一种错觉。我们通过在受控数据集上训练具备系统性泛化能力的Transformer模型来阐明这一点。首先，我们在Dyck平衡括号语言上训练模型。使用降维和聚类等工具简化这些模型后，我们明确测试这些简化代理模型在各种分布外测试集上与原始模型行为的一致性。研究发现，简化代理模型在分布外数据上普遍忠实度降低。当原始模型能泛化到新型结构或更深层级时，简化版本可能泛化失败，或反而泛化得更好。即使简化表示不直接依赖于训练分布，这一发现依然成立。接着，我们研究更具现实性的任务：预测计算机代码数据集中的下一个字符。我们观察到原始模型与简化代理模型之间存在类似的泛化差距，并进一步分析代码补全任务中哪些方面与最大差距相关。综合来看，我们的结果对使用SVD等工具获得的机械性解释能在多大程度上可靠预测模型在新情境中的行为提出了质疑。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日