RADAR：在非完美表格数据上评估语言模型的基准 (RADAR: Benchmarking Language Models on Imperfect Tabular Data)

Ken Gu,Zhihan Zhang,Kate Lin,Yuwei Zhang,Akshay Paruchuri,Hong Yu,Mehran Kazemi,Kumar Ayush,A. Ali Heydari,Maxwell A. Xu,Girish Narayanswamy,Yun Liu,Ming-Zher Poh,Yuzhe Yang,Mark Malhotra,Shwetak Patel,Hamid Palangi,Xuhai Xu,Daniel McDuff,Tim Althoff,Xin Liu

from arxiv, NeurIPS 2025 Dataset and Benchmark Track

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.

翻译：语言模型正越来越多地被部署用于执行自主数据分析。然而，其数据意识——即识别、推理并适当处理数据中诸如缺失值、异常值和逻辑不一致等数据伪影的能力——仍未得到充分探索。这些伪影在现实世界的表格数据中尤为常见，若处理不当，会严重损害分析结论的有效性。为填补这一空白，我们提出了RADAR，一个用于系统评估表格数据上数据感知推理能力的基准。我们开发了一个框架，通过程序化扰动来模拟数据伪影，以实现对模型行为的针对性评估。RADAR包含2980个表格-查询对，基于涵盖9个领域和5种数据伪影类型的真实世界数据。除了评估伪影处理能力外，RADAR还系统性地改变表格大小，以研究推理性能在表格尺寸增大时的保持情况。我们的评估表明，尽管前沿模型在无数据伪影的表格上表现尚可，但在引入数据伪影时性能显著下降，暴露了其在鲁棒、数据感知分析能力上的关键缺陷。RADAR设计为灵活且可扩展，支持多种扰动类型和可控的表格大小，为推进表格推理研究提供了宝贵的资源。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日