Evaluating and Enhancing Structural Understanding Capabilities of Large Language Models on Tables via Input Designs

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve NL-related tasks. However, there is still much to be learned about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack comprehensive studies examining whether LLMs can truly comprehend such data. In this paper we try to understand this by designing a benchmark to evaluate structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with their own unique challenges, e.g,, cell lookup, row retrieval and size detection. We run a series of evaluations on GPT-3 family models (e.g., text-davinci-003). We discover that the performance varied depending on a number of input choices, including table input format, content order, role prompting and partition marks. Drawing from the insights gained through the benchmark evaluations, we then propose self-augmentation for effective structural prompting, e.g., critical value / range identification using LLMs' internal knowledge. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe our benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data are released in https://anonymous.4open.science/r/StructuredLLM-76F3.

翻译：大型语言模型（LLMs）正逐渐成为解决自然语言相关任务的少样本推理器，然而，关于LLMs如何理解结构化数据（如表格）的研究仍不充分。尽管表格可通过序列化方式作为LLMs的输入，但目前缺乏系统研究检验LLMs能否真正理解此类数据。本文通过设计基准测试来评估LLMs的结构理解能力（SUC），以此探究该问题。我们构建的基准包含七项任务，每项任务均具有独特挑战，例如单元格查找、行检索和尺寸检测。我们对GPT-3系列模型（如text-davinci-003）进行了一系列评估，发现性能受多种输入选择影响，包括表格输入格式、内容顺序、角色提示和分隔标记。基于基准评估中获得的见解，我们提出了利用LLMs内部知识进行关键值/范围识别的自增强结构提示方法。当与精心选择的输入策略相结合时，这些结构提示方法在多种表格任务中显著提升了LLMs性能，例如TabFact（↑2.31%）、HybridQA（↑2.13%）、SQA（↑2.72%）、Feverous（↑0.84%）和ToTTo（↑5.68%）。我们相信，本文提出的基准和提示方法可作为未来研究的简单通用选择。代码和数据已发布于 https://anonymous.4open.science/r/StructuredLLM-76F3。