Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.
翻译:预训练语言模型(PLMs)在数据到文本(D2T)生成任务中,可以利用人类可读的数据标签(如列标题、键或关系名称)来泛化到域外示例。然而,如果这些标签存在歧义或不完整(这在D2T数据集中常见),模型容易生成语义不准确的输出。本文在描述两个实体之间关系的任务中揭示了这一问题。为进行实验,我们收集了一个新的数据集,用于对来自三个大规模知识图谱(Wikidata、DBPedia、YAGO)的1,522种独特关系进行语言化表达。研究发现,尽管预训练语言模型在D2T生成中会在模糊案例中失败,但使用多样化关系标签训练的模型在语言化表达新颖、未见关系时展现出惊人的鲁棒性。我们认为,使用包含多样化清晰且有意义标签的数据,是训练能够泛化到新领域的D2T生成系统的关键。