Table-to-text generation involves generating appropriate textual descriptions given structured tabular data. It has attracted increasing attention in recent years thanks to the popularity of neural network models and the availability of large-scale datasets. A common feature across existing methods is their treatment of the input as a string, i.e., by employing linearization techniques that do not always preserve information in the table, are verbose, and lack space efficiency. We propose to rethink data-to-text generation as a visual recognition task, removing the need for rendering the input in a string format. We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations encountered by existing models. PixT3 is trained with a new self-supervised learning objective to reinforce table structure awareness and is applicable to open-ended and controlled generation settings. Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and, in some settings, superior to generators that operate solely on text.
翻译:表格到文本生成任务旨在根据给定的结构化表格数据生成恰当的文本描述。近年来,得益于神经网络模型的普及和大规模数据集的可用性,该任务受到越来越多的关注。现有方法的一个共同特点是将输入视为字符串进行处理,即采用线性化技术,但这些技术并不总能完整保留表格中的信息,且通常冗长、空间效率低下。我们提出将数据到文本生成重新构想为一项视觉识别任务,从而无需将输入渲染为字符串格式。我们提出了PixT3,一种多模态表格到文本模型,它克服了现有模型面临的线性化挑战和输入大小限制。PixT3通过一种新的自监督学习目标进行训练,以增强对表格结构的感知能力,并适用于开放式和受控生成场景。在ToTTo和Logic2Text基准测试上的实验表明,PixT3具有竞争力,并且在某些设置下优于仅基于文本的生成器。