Understanding whether a generated table is of good quality is important to be able to use it in creating or editing documents using automatic methods. In this work, we underline that existing measures for table quality evaluation fail to capture the overall semantics of the tables, and sometimes unfairly penalize good tables and reward bad ones. We propose TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables, covering a range of topics and structures, in contrast to the limited scope of existing datasets. We compare TabEval with existing metrics using unsupervised and supervised text-to-table generation methods, demonstrating its stronger correlation with human judgments of table quality across four datasets.
翻译:判断生成的表格是否具有高质量对于利用自动化方法创建或编辑文档至关重要。本研究指出,现有的表格质量评估指标未能充分捕捉表格的整体语义,有时会不合理地贬低优质表格而奖励劣质表格。我们提出TabEval这一新颖的表格评估策略,该策略首先将表格分解为自然语言原子陈述列表,进而通过基于语义蕴含的度量方法将其与真实陈述进行比对,从而捕捉表格语义。为验证该方法,我们构建了一个包含1,250个多样化维基百科表格文本描述的数据集,这些表格涵盖广泛的主题和结构,与现有数据集的有限范围形成鲜明对比。通过无监督和有监督的文本到表格生成方法,我们将TabEval与现有指标进行比较,结果表明在四个数据集中,TabEval与人类对表格质量的评判具有更强的相关性。