Table-to-Text has been traditionally approached as a linear language to text problem. However, visually represented tables are rich in visual information and serve as a concise, effective form of representing data and its relationships. When using text-based approaches, after the linearization process, this information is either lost or represented in a space inefficient manner. This inefficiency has remained a constant challenge for text-based approaches making them struggle with large tables. In this paper, we demonstrate that image representation of tables are more space-efficient than the typical textual linearizations, and multi-modal approaches are competitive in Table-to-Text tasks. We present PixT3, a multimodal table-to-text model that outperforms the state-of-the-art (SotA) in the ToTTo benchmark in a pure Table-to-Text setting while remaining competitive in controlled Table-to-Text scenarios. It also generalizes better in unseen datasets, outperforming ToTTo SotA in all generation settings. Additionally, we introduce a new intermediate training curriculum to reinforce table structural awareness, leading to improved generation and overall faithfulness of the models.
翻译:表格到文本生成传统上被视为线性语言到文本的问题。然而,视觉呈现的表格富含视觉信息,是一种简洁有效的数据及其关系表示形式。当采用基于文本的方法时,经过线性化处理后,这些信息要么丢失,要么以低效的空间方式呈现。这种低效性一直是基于文本方法面临的持续挑战,使其难以处理大型表格。在本文中,我们证明表格的图像表示比典型的文本线性化方法更具空间效率,并且多模态方法在表格到文本任务中具有竞争力。我们提出PixT3,一种多模态表格到文本模型,在纯表格到文本设置下,在ToTTo基准测试中超越了现有技术水平(SotA),同时在受控表格到文本场景中保持竞争力。它还在未见过的数据集上展现出更好的泛化能力,在所有生成设置中均优于ToTTo的SotA。此外,我们引入了一种新的中间训练课程来强化表格结构感知能力,从而提升模型的生成质量和整体忠实度。