Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. Although tables can be used as input to LLMs with serialization, there is a lack of comprehensive studies that examine whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, e.g., cell lookup, row retrieval, and size detection. We perform a series of evaluations on GPT-3.5 and GPT-4. We find that performance varied depending on several input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose \textit{self-augmentation} for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research.
翻译:大型语言模型(LLMs)正逐渐成为解决自然语言(NL)相关任务的少样本推理器,但其对表格等结构化数据的理解能力仍有待深入探究。尽管可通过序列化操作将表格作为LLM输入,但现有研究缺乏系统性评估,难以判断LLM是否真正理解此类数据。本文通过设计基准测试来评估LLM的结构理解能力(SUC),该基准包含七项任务,每项任务具有独特挑战性(如单元格查找、行检索、规模检测)。我们在GPT-3.5和GPT-4上开展系列评估,发现模型性能受多种输入选择影响,包括表格输入格式、内容顺序、角色提示及分隔标记。基于基准测试获得的洞察,我们提出面向有效结构化提示的\textit{自增强}方法,例如利用LLM内部知识识别关键值/范围。当与精心选择的输入策略结合时,这些结构化提示方法在多种表格任务中显著提升LLM性能:TabFact(提升2.31%)、HybridQA(提升2.13%)、SQA(提升2.72%)、Feverous(提升0.84%)及ToTTo(提升5.68%)。我们相信,开源基准测试与所提出的提示方法可为未来研究提供简洁且通用的选择。