We investigate how large language models (LLMs) fail when tabular data in an otherwise canonical representation is subjected to semantic and structural distortions. Our findings reveal that LLMs lack an inherent ability to detect and correct subtle distortions in table representations. Only when provided with an explicit prior, via a system prompt, do models partially adjust their reasoning strategies and correct some distortions, though not consistently or completely. To study this phenomenon, we introduce a small, expert-curated dataset that explicitly evaluates LLMs on table question answering (TQA) tasks requiring an additional error-correction step prior to analysis. Our results reveal systematic differences in how LLMs ingest and interpret tabular information under distortion, with even SoTA models such as GPT-5.2 model exhibiting a drop of minimum 22% accuracy under distortion. These findings raise important questions for future research, particularly regarding when and how models should autonomously decide to realign tabular inputs, analogous to human behavior, without relying on explicit prompts or tabular data pre-processing.
翻译:本研究探讨了当原本规范的表格数据遭受语义和结构扭曲时,大型语言模型(LLMs)如何失效。我们的研究结果表明,LLMs缺乏检测和纠正表格表示中细微扭曲的内在能力。仅当通过系统提示提供明确的先验知识时,模型才会部分调整其推理策略并纠正某些扭曲,但这种纠正既不连贯也不彻底。为研究这一现象,我们引入了一个由专家精心构建的小型数据集,该数据集明确评估LLMs在表格问答(TQA)任务上的表现,这些任务要求在分析前增加一个纠错步骤。我们的结果揭示了LLMs在扭曲条件下摄取和解释表格信息的系统性差异,即使是GPT-5.2等最先进的模型,在扭曲条件下准确率也至少下降22%。这些发现为未来研究提出了重要问题,特别是关于模型应何时以及如何自主决定重新对齐表格输入(类似于人类行为),而不依赖于显式提示或表格数据预处理。