Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.
翻译:大型语言模型(LLM)在各种任务中展现出卓越的涌现能力,但在复杂推理与规划任务中仍存在不足。基于树搜索的推理方法通过超越思维链提示的能力,鼓励对中间步骤的探索,从而应对这一挑战。然而,由于需要系统性地探索和评估多条思维路径,此类方法会引入显著的推理延迟。本文提出SEED,一种新颖高效的推理框架,可同步优化运行时速度与GPU内存管理。通过采用调度推测执行机制,SEED高效处理思维生成与状态评估的多轮迭代,并利用轮次调度策略管理草稿模型的调度。在三个推理数据集上的大量实验评估表明,SEED具有卓越的加速性能,为无训练推测解码中的批处理推理提供了可行路径。