Table-to-text generation involves generating appropriate textual descriptions given structured tabular data. It has attracted increasing attention in recent years thanks to the popularity of neural network models and the availability of large-scale datasets. A common feature across existing methods is their treatment of the input as a string, i.e., by employing linearization techniques that do not always preserve information in the table, are verbose, and lack space efficiency. We propose to rethink data-to-text generation as a visual recognition task, removing the need for rendering the input in a string format. We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations encountered by existing models. PixT3 is trained with a new self-supervised learning objective to reinforce table structure awareness and is applicable to open-ended and controlled generation settings. Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and, in some settings, superior to generators that operate solely on text.
翻译:表格到文本生成涉及根据结构化的表格数据生成适当的文本描述。近年来,随着神经网络模型的普及和大规模数据集的可用性,这一领域引起了越来越多的关注。现有方法的共同特征是将输入视为字符串,即通过线性化技术进行处理,但这些技术并不总能保留表格中的信息,且存在冗长、空间效率低下的问题。我们提出将数据到文本生成重新构想为一项视觉识别任务,从而消除以字符串格式呈现输入的需求。我们提出了PixT3,这是一种多模态表格到文本模型,克服了现有模型面临的线性化和输入大小限制的挑战。PixT3通过一种新的自监督学习目标进行训练,以增强对表格结构的感知,并适用于开放式和受控生成设置。在ToTTo和Logic2Text基准上的实验表明,PixT3具有竞争力,并且在某些设置下优于仅基于文本的生成器。