GPT4Table: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack of comprehensive studies examining whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, \eg, cell lookup, row retrieval, and size detection. We run a series of evaluations on GPT-3.5 and GPT-4. We discover that the performance varied depending on a number of input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we then propose \textit{self-augmentation} for effective structural prompting, \eg, critical value / range identification using LLMs' internal knowledge. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, \eg, TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data are released in \url{https://anonymous.4open.science/r/StructuredLLM-76F3}.

翻译：大型语言模型（LLMs）作为少样本推理器解决自然语言相关任务正日益受到关注。然而，关于LLMs如何理解表格等结构化数据的能力仍有待深入探究。尽管通过序列化方法可将表格作为LLM输入，但目前尚缺乏系统性研究来验证LLM是否能真正理解此类数据。本文通过设计基准测试评估LLMs的结构理解能力来探究这一问题。我们创建的基准包含七项任务，每项任务均具有独特挑战，例如单元格查找、行检索和规模检测。基于GPT-3.5和GPT-4的一系列评估发现：模型性能会随着多种输入选择（包括表格输入格式、内容顺序、角色提示和分隔标记）而变化。通过基准测试获得的洞见，我们提出利用LLM内部知识进行关键值/范围识别的自增强式结构提示方法。当结合精心选择的输入方案后，这些结构提示方法在多项表格任务中显著提升了LLM性能，例如TabFact（↑2.31%）、HybridQA（↑2.13%）、SQA（↑2.72%）、Feverous（↑0.84%）和ToTTo（↑5.68%）。我们相信，所提出的基准测试和提示方法可为未来研究提供简洁而通用的基础方案。代码与数据已发布于\url{https://anonymous.4open.science/r/StructuredLLM-76F3}。