Despite significant progress having been made in question answering on tabular data (Table QA), it's unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e.g., replacing key question entities or shuffling table columns. To systematically study the robustness of Table QA models, we propose a benchmark called RobuT, which builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and includes human-annotated adversarial perturbations in terms of table header, table content, and question. Our results indicate that both state-of-the-art Table QA models and large language models (e.g., GPT-3) with few-shot learning falter in these adversarial sets. We propose to address this problem by using large language models to generate adversarial examples to enhance training, which significantly improves the robustness of Table QA models. Our data and code is publicly available at https://github.com/yilunzhao/RobuT.
翻译:尽管表格数据上的问答(Table QA)已取得显著进展,但现有Table QA模型在多大程度上能抵御任务特定扰动(如替换关键问题实体或打乱表格列顺序)仍不明确。为系统研究Table QA模型的鲁棒性,我们提出名为RobuT的基准测试,该基准建立在现有Table QA数据集(WTQ、WikiSQL-Weak和SQA)之上,并包含针对表头、表格内容和问题的人工标注对抗性扰动。结果表明,最先进的Table QA模型以及使用少样本学习的大型语言模型(如GPT-3)在这些对抗性测试集中均表现不佳。为解决此问题,我们提出利用大型语言模型生成对抗性样本以增强训练,该方法显著提升了Table QA模型的鲁棒性。我们的数据和代码已公开于https://github.com/yilunzhao/RobuT。