Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there is a lack of comprehensive studies examining whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, \eg, cell lookup, row retrieval, and size detection. We conduct a series of evaluations on GPT-3.5 and GPT-4. We find that the performance varied depending on several input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose \textit{self-augmentation} for effective structural prompting, such as critical value / range identification using LLMs' internal knowledge. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, \eg, TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our benchmark and proposed prompting methods can serve as a simple yet generic selection for future research.
翻译:大型语言模型(LLMs)正在成为解决自然语言(NL)相关任务时具有吸引力的少样本推理工具。然而,关于LLMs如何理解结构化数据(如表格)的能力仍存在许多未知。尽管通过序列化技术可以将表格作为LLMs的输入,但目前缺乏全面的研究来评估LLMs是否真正理解此类数据。本文通过设计一个基准测试框架来评估LLMs的结构化理解能力(SUC),旨在探究这一问题。我们构建的基准测试包含七项任务,每项任务具有独特挑战(例如单元格查找、行检索和规模检测)。针对GPT-3.5和GPT-4开展系列评估发现,模型性能受多种输入选择影响,包括表格输入格式、内容顺序、角色提示和分区标记。基于基准测试获得的洞察,我们提出了一种用于有效结构提示的“自我增强”方法,例如利用LLMs内部知识进行关键值/范围识别。当与精心选择的输入配置结合使用时,这些结构提示方法在多种表格任务中显著提升了LLMs性能,例如TabFact(↑2.31%)、HybridQA(↑2.13%)、SQA(↑2.72%)、Feverous(↑0.84%)和ToTTo(↑5.68%)。我们相信,该基准框架与所提出的提示方法可为未来研究提供简单而通用的选择。