MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

from arxiv, Full version of a paper accepted at NeurIPS 2025; Code and data available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU

Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

翻译：表格及基于表格的应用场景在许多重要的现实应用中扮演着关键角色，例如电子表格、数据库和计算笔记本，这些传统上需要数据工程师、数据分析师和数据库管理员等专家级用户来操作。尽管大型语言模型在处理表格方面已展现出显著进展（例如在电子表格和数据库助手场景中），对此类能力的全面基准测试仍然有限。与日益增多的自然语言处理基准相比，表格相关任务的评估非常稀少，且狭隘地聚焦于自然语言到SQL和表格问答等任务，忽视了专业用户所面临的更广泛的实际任务谱系。这一差距限制了我们在此重要领域的理解和模型进展。在本工作中，我们提出了MMTU，这是一个包含超过2.8万个问题、覆盖25个现实世界表格任务的大规模基准，旨在全面评估模型在专家级别上理解、推理和操作真实表格的能力。这些任务源自数十年来的计算机科学表格数据研究，重点关注专业用户面临的复杂表格任务。我们表明，MMTU需要结合表格理解、推理和编码等多种技能，这对当前的前沿模型仍具挑战性，即使是像OpenAI GPT-5和DeepSeek R1这样的前沿推理模型，其得分也分别仅为约69%和57%，表明存在显著的改进空间。我们重点介绍了使用MMTU进行评估的关键发现，并希望该基准能推动在理解和开发用于结构化数据处理与分析的基础模型方面取得进一步进展。我们的代码和数据可在 https://github.com/MMTU-Benchmark/MMTU 和 https://huggingface.co/datasets/MMTU-benchmark/MMTU 获取。