Large Language Models (LLMs), already shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of \textit{in-context learning}, \textit{model scale}, \textit{instruction tuning}, and \textit{domain biases} on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based \textbf{WTQ}, financial report-based \textbf{TAT-QA}, and scientific claims-based \textbf{SCITAB}, TQA datasets, focusing on their ability to interpret tabular data under various augmentations and perturbations robustly. Our findings indicate that instructions significantly enhance performance, with recent models exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with \textbf{WTQ}. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
翻译:大型语言模型(LLM)已被证明能够出色完成各类文本理解任务,且未经专门训练即可显著处理表格理解任务。尽管先前研究已探讨了LLM在表格数据集任务中的能力,但本研究评估了上下文学习、模型规模、指令微调以及领域偏差对表格问答任务的影响。我们在基于维基百科的WTQ、基于财务报告的TAT-QA以及基于科学主张的SCITAB这三个TQA数据集上评估LLM的鲁棒性,重点关注其在不同增强与扰动条件下稳健解读表格数据的能力。研究结果表明,指令能显著提升模型性能,且近期模型较早期版本展现出更强的鲁棒性。然而,数据污染和实际可靠性问题依然存在,尤其在WTQ数据集中表现明显。我们强调需要改进方法论,包括采用结构感知的自注意力机制以及更好地处理领域特定的表格数据,以开发更可靠的表格理解LLM。