ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

翻译：构建提取-加载-转换（ELT）流水线是一项劳动密集型的数据工程任务，也是AI自动化应用的高影响力目标。在首个面向端到端ELT流水线构建的基准测试ELT-Bench上，AI智能体的初始成功率较低，表明其缺乏实际应用价值。我们重新审视这些结果，并发现导致智能体能力被大幅低估的两类因素。首先，使用升级后的大语言模型重新评估ELT-Bench发现，提取与加载阶段基本得到解决，而转换阶段的性能则显著提升。其次，我们开发了"审计-校正"方法，将可扩展的LLM驱动的根因分析与严格的人工验证（标注者间一致性Fleiss' kappa=0.85）相结合，用于审计基准测试质量。将此方法应用于ELT-Bench，发现大多数失败的转换任务中存在基准可归因错误——包括僵化的评估脚本、模糊的规范和错误的真值——这些错误惩罚了正确的智能体输出。基于这些发现，我们构建了ELT-Bench-Verified，一个经过评估逻辑改进和真值校正的修订版基准。新版基准上的重新评估结果显示出完全归因于基准校正的显著性能提升。我们的结果表明，模型快速改进和基准质量问题是共同导致智能体能力被低估的原因。更广泛而言，我们的发现与文本到SQL基准中普遍存在注释错误的观察相呼应，表明数据工程评估中的质量问题具有系统性。系统性质量审计应成为复杂智能体任务的标准化实践。我们发布ELT-Bench-Verified，为AI驱动的数据工程自动化的进展提供更可靠的基础。