PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

翻译：摘要：数据准备是数据分析流程中核心且耗时的阶段。传统上，商业工具依赖图形用户界面（GUI）来简化数据准备，允许用户通过可视化操作符和工作流定义转换。大型语言模型（LLM）的最新进展引发了向自然语言驱动数据准备范式转变的可能性，用户可直接使用自然语言指定准备意图。然而，当前基于LLM的智能体距离这一范式转变的实际应用仍不明确。现有代码生成基准未能捕捉数据准备的关键特性，包括模糊的用户意图、不完美的真实世界数据，以及需要将代码转化为可解释工作流以进行验证的需求。为填补这一空白，我们提出了PrepBench，一个旨在从三个核心能力评估自然语言驱动数据准备的基准：交互式消歧、准备代码生成以及代码到工作流的转换。我们从Preppin' Data Challenges中爬取数据，并将其扩展为系统设计的基准。该基准涵盖多个领域，每个任务包含3至18个数据准备步骤。近半数任务需要超过100行Python代码，最长解决方案接近300行。我们的评估表明，尽管近期取得了进展，但实现这一范式转变对于最先进的LLM而言仍具挑战。PrepBench为衡量这一差距提供了原则性基准，并有助于识别实现自然语言驱动数据准备过程中的关键挑战。