Generative Transformer-based models have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not fully understood and not always satisfying. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. Curiously, it is observed that when training on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably and mysteriously on longer, unseen cases (out-of-distribution (OOD) generalization). Studies try to bridge this gap with workarounds such as modifying position embedding, fine-tuning, and priming with more extensive or instructive data. However, without addressing the essential mechanism, there is hardly any guarantee regarding the robustness of these solutions. We bring this unexplained performance drop into attention and ask whether it is purely from random errors. Here we turn to the mechanistic line of research which has notable successes in model interpretability. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with equivalence relations in the ID domain. These highlight the potential of the models to carry useful information for improved generalization.
翻译:基于生成式Transformer的模型在解决多样问题方面展现出卓越的能力。然而,其泛化能力尚未被完全理解,且并不总是令人满意。研究者将n位数字加法或乘法等基础数学任务作为探讨其泛化行为的重要视角。有趣的是,当训练n位运算(如加法)且输入操作数均为n位时,模型能成功泛化至未见过的n位输入(即分布内泛化),但在更长、未见过的案例(即分布外泛化)上却神秘地表现极差。现有研究尝试通过修改位置编码、微调以及用更广泛或更具指导性的数据进行预训练等变通方法弥补这一差距。然而,若未触及本质机制,这些方案的鲁棒性几乎没有保障。我们聚焦于这种尚未被解释的性能下降现象,并质疑其是否纯粹源于随机误差。为此,我们转向在模型可解释性方面取得显著成功的机制性研究方向。我们发现,强大的分布内泛化源于结构化表征,而尽管分布外泛化表现不尽如人意,模型仍展现出清晰习得的代数结构。具体而言,这些模型将未见过的分布外输入映射至与分布内域具有等价关系的输出。这些发现凸显了模型在改进泛化方面携带有用信息的潜力。