Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
翻译:推测性解码(Speculative Decoding,SD)已成为加速大语言模型(LLM)推理的关键技术。与确定性系统优化不同,SD性能天然具有数据依赖性,这意味着准确评估其效果需要多样且具代表性的工作负载。现有基准存在任务多样性不足、缺乏对面向吞吐量评估的支持,以及依赖无法反映生产环境的高级实现等问题。为解决这些局限性,我们提出SPEED-Bench——一个旨在跨多样语义领域和真实服务场景标准化SD评估的综合套件。SPEED-Bench提供经过精心筛选的定性数据拆分组,通过优先考虑数据样本间的语义多样性进行选取。此外,它还包含吞吐量数据拆分组,支持从延迟敏感的低批量设置到面向吞吐量的高负载场景等不同并发程度下的加速性能评估。通过与vLLM和TensorRT-LLM等生产级引擎集成,SPEED-Bench使研究人员能够分析其他基准常掩盖的系统行为。我们通过量化合成输入如何高估真实吞吐量、识别依赖批量大小的最优草稿长度与低多样性数据中的偏差、以及分析最先进起草器中词汇剪枝的潜在问题,突出展示了这一特性。我们发布SPEED-Bench,旨在为SD算法的实用比较建立统一的评估标准。