Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.
翻译:表格问答(TQA)旨在基于表格数据为问题生成答案。尽管已有研究表明TQA模型缺乏鲁棒性,但理解这一问题的根本原因和本质仍不明确,这成为开发鲁棒TQA系统的主要障碍。本文规范了细粒度评估TQA系统鲁棒性的三个核心准则:系统应(i)无视表格结构变化正确回答问题;(ii)基于相关单元格内容而非偏见作出响应;(iii)展现稳健的数值推理能力。为探究这些方面,我们创建并发布了首个英文TQA评估基准。广泛实验分析表明,所有受测的先进TQA系统均未能在上述三个方面持续表现出色。本基准是监测TQA系统行为的关键工具,为开发鲁棒TQA系统铺平了道路。我们已公开该基准。