Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

from arxiv, This paper has been accepted as a full paper at WSDM 2024. Explore the MS research blog of our work at https://www.microsoft.com/en-us/research/blog/improving-llm-understanding-of-structured-data-and-exploring-advanced-prompting-methods/

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area. While tables can be serialized as input for LLMs, there is a lack of comprehensive studies on whether LLMs genuinely comprehend this data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs through seven distinct tasks, e.g., cell lookup, row retrieval and size detection. Specially, we perform a series of evaluations on the recent most advanced LLM models, GPT-3.5 and GPT-4 and observe that performance varied with different input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose $\textit{self-augmentation}$ for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data of this paper will be temporality released at https://anonymous.4open.science/r/StructuredLLM-76F3/README.md and will be replaced with an official one at https://github.com/microsoft/TableProvider later.

翻译：大型语言模型（LLMs）作为少样本推理器，在解决自然语言相关任务方面正变得极具吸引力。然而，对其处理表格等结构化数据能力的理解仍是一个尚未充分探索的领域。尽管表格可以被序列化作为LLMs的输入，但目前缺乏关于LLMs是否真正理解此类数据的全面研究。在本文中，我们试图通过设计一个基准测试来理解这一点，该基准通过七个不同的任务（例如，单元格查找、行检索和大小检测）来评估LLMs的结构理解能力。特别地，我们对近期最先进的大型语言模型GPT-3.5和GPT-4进行了一系列评估，并观察到其性能随不同输入选择而变化，包括表格输入格式、内容顺序、角色提示和分区标记。基于基准评估获得的见解，我们提出了$\textit{自增强}$方法以实现有效的结构提示，例如利用LLMs的内部知识进行关键值/范围识别。当与精心选择的输入选项结合时，这些结构提示方法在各种表格任务上带来了显著的LLM性能提升，例如TabFact($\uparrow2.31\%$)、HybridQA($\uparrow2.13\%$)、SQA($\uparrow2.72\%$)、Feverous($\uparrow0.84\%$)和ToTTo($\uparrow5.68\%$)。我们相信，我们开源的基准测试和提出的提示方法可以作为未来研究的一个简单而通用的选择。本文的代码和数据将暂时发布于https://anonymous.4open.science/r/StructuredLLM-76F3/README.md，稍后将在https://github.com/microsoft/TableProvider 替换为官方版本。