Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
翻译:大型语言模型(LLMs)最初被证明能出色完成各类文本理解任务,近期研究也显著表明其无需专门训练即可处理表格理解任务。尽管先前研究已探讨了LLM在表格数据集任务上的能力,但本研究评估了$\textit{上下文学习}$、$\textit{模型规模}$、$\textit{指令微调}$和$\textit{领域偏差}$对表格问答任务的影响。我们在基于维基百科的$\textbf{WTQ}$和基于财务报告的$\textbf{TAT-QA}$表格问答数据集上评估LLM的鲁棒性,重点关注模型在各种数据增强与扰动下对表格数据的稳健解析能力。研究结果表明:指令能显著提升模型性能,Llama3等最新模型较早期版本展现出更强的鲁棒性;但数据污染和实际可靠性问题依然存在,在WTQ数据集中尤为突出。我们强调需要改进方法论,包括采用结构感知的自注意力机制以及优化领域特定表格数据的处理方式,以开发更可靠的表格理解LLM。