Compositional generalization is an important ability of language models and has many different manifestations. For data-to-text generation, previous research on this ability is limited to a single manifestation called Systematicity and lacks consideration of large language models (LLMs), which cannot fully cover practical application scenarios. In this work, we propose SPOR, a comprehensive and practical evaluation method for compositional generalization in data-to-text generation. SPOR includes four aspects of manifestations (Systematicity, Productivity, Order invariance, and Rule learnability) and allows high-quality evaluation without additional manual annotations based on existing datasets. We demonstrate SPOR on two different datasets and evaluate some existing language models including LLMs. We find that the models are deficient in various aspects of the evaluation and need further improvement. Our work shows the necessity for comprehensive research on different manifestations of compositional generalization in data-to-text generation and provides a framework for evaluation.
翻译:组合泛化是语言模型的重要能力,具有多种不同的表现形式。在数据到文本生成任务中,先前关于该能力的研究局限于单一表现形式(系统化性),且缺乏对大型语言模型的考量,无法完全覆盖实际应用场景。本研究提出SPOR,一种针对数据到文本生成中组合泛化能力的全面且实用的评估方法。SPOR包含四种表现形式(系统化性、生产力、顺序不变性和规则可学习性),并允许基于现有数据集进行无需额外人工标注的高质量评估。我们在两个不同数据集上演示了SPOR,并评估了包括大型语言模型在内的若干现有语言模型。研究发现,这些模型在评估的各个方面均存在不足,亟需进一步改进。本工作揭示了在数据到文本生成中对组合泛化不同表现形式进行全面研究的必要性,并为评估提供了框架支持。