Existing data-to-text generation efforts mainly focus on generating a coherent text from non-linguistic input data, such as tables and attribute-value pairs, but overlook that different application scenarios may require texts of different styles. Inspired by this, we define a new task, namely stylized data-to-text generation, whose aim is to generate coherent text for the given non-linguistic data according to a specific style. This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples. To address these challenges, we propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation. In the first component, we introduce a graph-guided logic planner for attribute organization to ensure the logic of generated text. In the second component, we devise feature-level mask-based style embedding to extract the essential style signal from the given unstructured style reference. In the last one, pseudo triplet augmentation is utilized to achieve unbiased text generation, and a multi-condition based confidence assignment function is designed to ensure the quality of pseudo samples. Extensive experiments on a newly collected dataset from Taobao have been conducted, and the results show the superiority of our model over existing methods.
翻译:现有数据到文本生成研究主要关注从非语言输入数据(如表格和属性值对)生成连贯文本,但忽略了不同应用场景可能需要不同风格的文本。受此启发,我们定义了一个新任务——风格化数据到文本生成,其目标是根据特定风格为给定非语言数据生成连贯文本。该任务因三个挑战而具有难度:生成文本的逻辑性、非结构化风格参考以及有偏训练样本。为解决这些挑战,我们提出了一种新颖的风格化数据到文本生成模型,命名为StyleD2T,包含三个组件:逻辑规划增强的数据嵌入、基于掩码的风格嵌入以及无偏风格化文本生成。第一个组件中,我们引入图引导逻辑规划器进行属性组织,以确保生成文本的逻辑性。第二个组件中,我们设计了特征级基于掩码的风格嵌入,从给定的非结构化风格参考中提取核心风格信号。最后一个组件中,采用伪三元组增强实现无偏文本生成,并设计了基于多条件的置信度分配函数以确保伪样本质量。在从淘宝新收集的数据集上进行了广泛实验,结果表明我们的模型优于现有方法。