A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.
翻译:研究深度学习系统的常用方法之一是使用简化模型表示——例如,利用奇异值分解将模型的隐藏状态投影到低维空间进行可视化。这种方法假设简化结果忠实于原始模型。在此,我们揭示了该假设的一个重要局限:即使简化表示能够准确近似训练集上的完整模型,它们也可能无法正确捕捉模型在分布外数据上的行为——基于简化表示形成的理解可能是一种错觉。我们通过在受控数据集上训练具备系统性泛化能力的Transformer模型来阐明这一点。首先,我们在Dyck平衡括号语言上训练模型。使用降维和聚类等工具简化这些模型后,我们明确测试这些简化代理模型在各种分布外测试集上与原始模型行为的一致性。研究发现,简化代理模型在分布外数据上普遍忠实度降低。当原始模型能泛化到新型结构或更深层级时,简化版本可能泛化失败,或反而泛化得更好。即使简化表示不直接依赖于训练分布,这一发现依然成立。接着,我们研究更具现实性的任务:预测计算机代码数据集中的下一个字符。我们观察到原始模型与简化代理模型之间存在类似的泛化差距,并进一步分析代码补全任务中哪些方面与最大差距相关。综合来看,我们的结果对使用SVD等工具获得的机械性解释能在多大程度上可靠预测模型在新情境中的行为提出了质疑。