Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp
翻译:近期关于组合泛化的诊断数据集,如SCAN(Lake和Baroni,2018)和COGS(Kim和Linzen,2020),揭示了在这些数据集上从头训练的模型存在严重问题。然而,与这种糟糕表现相反的是,在更大规模、更通用的数据集上训练的最先进模型展现出更好的泛化能力。为调和这一矛盾,本研究通过在不同数据因素(包括数据集规模、模式复杂度、示例难度等)的训练集上训练Transformer模型进行实证分析。首先,我们证明增加数据集复杂度可以在多种不同的泛化挑战上带来更好的泛化行为。为进一步理解这种改进,我们揭示了更复杂数据集带来的两大益处:它们提供更多样化的示例,从而使组合理解更有效;同时,由于减少了示例的重复频率,它们防止了不可泛化的记忆。最后,我们探讨不同难度级别的训练示例如何差异性地影响泛化。在合成数据集上,简单示例比困难示例更能激发组合性。在更大规模的真实语言数据集上,尽管困难示例可能对确保充分的数据覆盖更为重要,但简单与困难示例的平衡混合却能诱导出最强的泛化能力。本研究的代码和数据可从https://github.com/owenzx/data4comp获取。