Compositional generalization is an important ability of language models and has many different manifestations. For data-to-text generation, previous research on this ability is limited to a single manifestation called Systematicity and lacks consideration of large language models (LLMs), which cannot fully cover practical application scenarios. In this work, we propose SPOR, a comprehensive and practical evaluation method for compositional generalization in data-to-text generation. SPOR includes four aspects of manifestations (Systematicity, Productivity, Order invariance, and Rule learnability) and allows high-quality evaluation without additional manual annotations based on existing datasets. We demonstrate SPOR on two different datasets and evaluate some existing language models including LLMs. We find that the models are deficient in various aspects of the evaluation and need further improvement. Our work shows the necessity for comprehensive research on different manifestations of compositional generalization in data-to-text generation and provides a framework for evaluation.
翻译:组合泛化是语言模型的重要能力,具有多种不同的表现形式。在数据到文本生成领域,先前针对该能力的研究仅限于"系统性"这一单一表现形式,且缺乏对大型语言模型的考量,无法全面覆盖实际应用场景。本研究提出SPOR,一种针对数据到文本生成中组合泛化的全面实用评估方法。SPOR涵盖系统性、产出性、顺序不变性与规则可学性四个表现维度,并能在现有数据集基础上无需额外人工标注即可实现高质量评估。我们在两个不同数据集上验证SPOR方法,并对包括大型语言模型在内的现有语言模型进行评估。研究发现现有模型在评估的多个维度均存在不足,亟待改进。本工作揭示了数据到文本生成中组合泛化不同表现形式进行全面研究的必要性,并为相关评估提供了系统框架。