Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.
翻译:大多数数据到文本数据集为英语设计,因此低资源语言数据到文本建模的困难尚未得到充分探索。本文针对低资源且为黏着语的伊西科萨语,提出其数据到文本生成任务。我们引入Triples-to-isiXhosa(T2X)——基于WebNLG子集构建的新数据集,该数据集呈现新型语言语境,将建模需求转向子词驱动技术。同时开发T2X评估框架,通过度量生成文本描述数据的准确性,使未来T2X用户超越表层评价指标。在建模方面,我们探索两类方法——从零训练的专用数据到文本模型与预训练语言模型(PLMs)。针对黏着语数据到文本生成提出新型专用架构——子词分段指针生成器(SSPG),该架构联合学习词分割与实体复制,在伊西科萨语和芬兰语两种黏着语上均超越现有专用模型。我们研究T2X的预训练解决方案,发现标准PLMs效果不足,而微调机器翻译模型成为整体最优方法。这些发现凸显T2X的特殊挑战:既有的成熟数据到文本架构与常规预训练方法均非最优选择。最后通过生成错误定性分析与消融研究进行总结。