Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset, but easily overfit on datasets of insufficient complexity. We observe that when the training set is sufficiently complex, the model encodes sentences that have a common syntactic structure using a systematic attention pattern. Inspired by this observation, we propose SQ-Transformer (Structurally Quantized) that explicitly encourages systematicity in the embeddings and attention layers, even with a training set of low complexity. At the embedding level, we introduce Structure-oriented Vector Quantization (SoVQ) to cluster word embeddings into several classes of structurally equivalent entities. At the attention level, we devise the Systematic Attention Layer (SAL) and an alternative, Systematically Regularized Layer (SRL) that operate on the quantized word embeddings so that sentences of the same structure are encoded with invariant or similar attention patterns. Empirically, we show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets. In our analysis, we show that SoVQ indeed learns a syntactically clustered embedding space and SAL/SRL induces generalizable attention patterns, which lead to improved systematicity.
翻译:变换器在处理复杂数据集后能够泛化到新组合的结构和实体,但在复杂度不足的数据集上容易过拟合。我们观察到,当训练集足够复杂时,模型会利用系统性注意力模式编码具有共同句法结构的句子。受此启发,我们提出SQ-Transformer(结构量化)在嵌入和注意力层中明确鼓励系统性,即使是低复杂度的训练集也能实现。在嵌入层面,我们引入结构导向向量量化(SoVQ)将词嵌入聚类为若干结构等价实体类别。在注意力层面,我们设计系统性注意力层(SAL)及其替代方案——系统性正则化层(SRL),对量化后的词嵌入进行操作,使具有相同结构的句子能够通过不变或相似的注意力模式进行编码。实验表明,SQ-Transformer在多个低复杂度的语义解析和机器翻译数据集上比原始变换器具有更强的组合泛化能力。分析显示,SoVQ确实学习到了句法聚类的嵌入空间,SAL/SRL诱导了可泛化的注意力模式,从而提升了系统性。