Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.
翻译:语言模型正越来越多地被部署用于执行自主数据分析。然而,其数据意识——即识别、推理并适当处理数据中诸如缺失值、异常值和逻辑不一致等数据伪影的能力——仍未得到充分探索。这些伪影在现实世界的表格数据中尤为常见,若处理不当,会严重损害分析结论的有效性。为填补这一空白,我们提出了RADAR,一个用于系统评估表格数据上数据感知推理能力的基准。我们开发了一个框架,通过程序化扰动来模拟数据伪影,以实现对模型行为的针对性评估。RADAR包含2980个表格-查询对,基于涵盖9个领域和5种数据伪影类型的真实世界数据。除了评估伪影处理能力外,RADAR还系统性地改变表格大小,以研究推理性能在表格尺寸增大时的保持情况。我们的评估表明,尽管前沿模型在无数据伪影的表格上表现尚可,但在引入数据伪影时性能显著下降,暴露了其在鲁棒、数据感知分析能力上的关键缺陷。RADAR设计为灵活且可扩展,支持多种扰动类型和可控的表格大小,为推进表格推理研究提供了宝贵的资源。