Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.
翻译:从可能包含不洁、半结构化和非结构化数据的真实世界数据湖中发掘洞察,需要执行从提取、清洗到集成、分析和建模等一系列数据处理任务。这一过程通常还需要领域知识和项目特定的洞察力。尽管人工智能模型在推理和代码生成方面已展现出显著成果,但其设计和执行复杂流程以解决这些从数据湖到洞察的挑战的能力仍不明确。我们提出了KramaBench,该基准包含104个经过人工整理和解决的挑战,涵盖1700个文件、24个数据源和6个领域。KramaBench专注于测试人工智能系统解决需要自动化编排不同数据任务的挑战的端到端能力。KramaBench还具备一个全面的评估框架,用于评估人工智能系统的流程设计能力和单个数据任务实现能力。我们使用单智能体参考框架DS-Guru评估了8个大型语言模型,同时评估了开源和闭源的单智能体与多智能体系统。研究发现,尽管当前的智能体系统能够处理孤立的数据科学任务并生成看似合理的流程草案,但在生成可运行的端到端流程方面仍存在困难。在KramaBench上,最佳系统在完整数据湖设置中仅达到55%的端到端准确率。即使在完美检索的情况下,准确率最高也仅为62%。领先的大型语言模型能够识别多达42%的重要数据任务,但仅能完整实现20%的单个数据任务。我们的代码、参考框架和数据可在https://github.com/mitdbg/KramaBench获取。