Learning to Substitute Components for Compositional Generalization

Despite the rising prevalence of neural language models, recent empirical evidence suggests their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, which aims to introduce additional compositional inductive bias. However, existing handcrafted augmentation strategies offer limited improvement when systematic generalization of neural language models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases alone) or when training sentences have an imbalanced difficulty distribution. To address these challenges, we first propose a novel compositional augmentation strategy called Component Substitution (CompSub), which enables multi-grained composition of substantial substructures across the entire training set. Furthermore, we introduce the Learning Component Substitution (LCS) framework. This framework empowers the learning of component substitution probabilities in CompSub in an end-to-end manner by maximizing the loss of neural language models, thereby prioritizing challenging compositions with elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS to the recently emerging in-context learning scenarios of pre-trained large language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we provide insights into why applying our algorithms to language models can improve compositional generalization performance. Empirically, our results on four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.

翻译：尽管神经语言模型日益普及，但近期实证证据表明其在组合泛化方面存在不足。当前解决该问题的实际方案之一是组合数据增强，旨在引入额外的组合归纳偏置。然而，当神经语言模型的系统泛化需要多粒度组合偏置（即不局限于词汇或结构偏置中的单一类型），或训练句子存在不均衡的难度分布时，现有手工设计的增强策略改进有限。为应对这些挑战，我们首先提出了一种称为组件替换（CompSub）的新型组合增强策略，该策略支持在整个训练集上实现实质性子结构的多粒度组合。进一步，我们提出了学习组件替换（LCS）框架。该框架通过最大化神经语言模型的损失，以端到端方式学习CompSub中的组件替换概率，从而优先处理包含隐晦概念和新颖语境的挑战性组合。我们将CompSub和LCS的核心思想拓展至近期兴起的预训练大语言模型（LLMs）上下文学习场景，提出LCS-ICL算法以增强最先进（SOTA）LLMs的少样本组合泛化能力。理论上，我们阐释了将所提算法应用于语言模型能够提升组合泛化性能的内在机理。实证方面，我们在四个标准组合泛化基准（SCAN、COGS、GeoQuery和COGS-QL）上的实验结果表明，CompSub、LCS和LCS-ICL分别取得了最高达66.5%、10.3%、1.4%和8.8%的性能提升，验证了其优越性。