Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.
翻译:预训练语言模型(PLMs)在数据到文本(D2T)生成任务中,能够利用人类可读的数据标签(如列标题、键或关系名)泛化至域外示例。然而,若这些标签存在歧义或不完整(这在D2T数据集中较为常见),模型易生成语义不准确的输出。本文在描述两实体间关系的任务中揭示了这一问题。为进行实验,我们收集了一个新数据集,该数据集对来自三个大规模知识图谱(Wikidata、DBPedia、YAGO)的1,522种不同关系进行语言化表达。研究发现,尽管用于D2T生成的PLMs在模糊案例中表现不如预期,但经多样化关系标签训练的模型在语言化表达未见关系时展现出惊人的鲁棒性。我们认为,使用具有清晰且有意义标签的多样化数据,是训练能够泛化至新领域的D2T生成系统的关键。