BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing -- A Proposed Framework

Autoscaling has become a baseline expectation for cloud-native big data processing, and the design space has expanded beyond rule-based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet despite a growing body of work spanning these paradigms, the community lacks a shared benchmark for comparing them. Existing evaluations rely on synthetic TPC-style queries, vendor blog posts with proprietary baselines, or narrow trace replays. Each new policy reports favorable numbers against a different baseline, on a different workload, with a different cost model, making cross-paper comparison effectively impossible. This is a position paper. We propose BatchBench, an open benchmarking framework designed to place rule-based, learned, and agentic autoscaling policies on equal experimental footing. The contribution is the design of the framework, not empirical results. We contribute: (1) a workload taxonomy of six batch processing classes synthesized from published autoscaling benchmarks and publicly released cluster traces; (2) the design of a parameterized workload generator with a validation methodology based on two-sample Kolmogorov-Smirnov and earth-mover distance; (3) a five-axis evaluation harness specification covering cost, SLA attainment, scaling responsiveness, scaling thrash, and decision interpretability, with first-class accounting for LLM inference cost; and (4) a standardized agent interface that lets LLM-based and reinforcement-learning autoscalers be evaluated alongside rule-based controllers with a single API. We discuss the expected evaluation surface, identify open research questions the framework is designed to answer, and outline a roadmap for the empirical paper that will follow. BatchBench's reference implementation is in active development and will be released as open source.

翻译：自动扩缩已成为云原生大数据处理的基线需求，其设计空间已从基于规则的启发式方法扩展到学习型控制器，以及近期的大语言模型智能体。然而，尽管这些范式相关研究日益增多，该领域仍缺乏用于比较它们的共享基准。现有评估依赖于合成TPC类查询、采用专有基线的供应商博客文章，或狭窄的轨迹回放。每个新策略都针对不同基线、不同工作负载、采用不同成本模型报告有利数据，使得跨论文比较实际上不可能实现。本文是一篇立场论文。我们提出BatchBench，一个开放基准框架，旨在将基于规则、学习型和智能体自动扩缩策略置于同等实验基础之上。其贡献在于框架设计，而非实证结果。我们的贡献包括：(1) 构建包含六类批处理工作负载的分类体系，这些分类来源于已发表的自动扩缩基准和公开集群轨迹；(2) 设计参数化工作负载生成器，并基于双样本科尔莫戈罗夫-斯米尔诺夫检验和推土机距离提出验证方法；(3) 设计五轴评估规范，涵盖成本、SLA达成率、扩缩响应性、扩缩震荡和决策可解释性，并对大语言模型推理成本进行首要核算；(4) 设计标准化智能体接口，使基于大语言模型和强化学习的自动扩缩器能通过单一应用程序编程接口与基于规则的控制器一同评估。我们讨论了预期评估范围，指出了框架设计旨在回答的未决研究问题，并概述了后续实证论文的研究路线图。BatchBench的参考实现正在积极开发中，并将以开源形式发布。