Large language models (LLMs) have seen significant adoption for natural language tasks, owing their success to massive numbers of model parameters (e.g., 70B+); however, LLM inference incurs significant computation and memory costs. Recent approaches propose parallel decoding strategies, such as Skeleton-of-Thought (SoT), to improve performance by breaking prompts down into sub-problems that can be decoded in parallel; however, they often suffer from reduced response quality. Our key insight is that we can request additional information, specifically dependencies and difficulty, when generating the sub-problems to improve both response quality and performance. In this paper, we propose Skeleton Graph Decoding (SGD), which uses dependencies exposed between sub-problems to support information forwarding between dependent sub-problems for improved quality while exposing parallelization opportunities for decoding independent sub-problems. Additionally, we leverage difficulty estimates for each sub-problem to select an appropriately-sized model, improving performance without significantly reducing quality. Compared to standard autoregressive generation and SoT, SGD achieves a 1.69x speedup while improving quality by up to 51%.
翻译:大语言模型(LLMs)在自然语言任务中得到了广泛应用,其成功归因于海量的模型参数(如700亿以上);然而,大语言模型推理会带来巨大的计算和内存成本。近期方法提出并行解码策略,例如思维骨架(Skeleton-of-Thought,SoT),通过将提示分解为可并行解码的子问题来提升性能,但这类方法常导致响应质量下降。我们的关键洞察在于,生成子问题时可通过请求额外信息(具体为依赖关系和难度)来同时提升响应质量与性能。本文提出骨架图解码(Skeleton Graph Decoding,SGD),它利用子问题间暴露的依赖关系,支持有依赖的子问题间信息传递以提升质量,同时为独立子问题的解码提供并行化机会。此外,我们利用每个子问题的难度估计值选择规模适配的模型,在不显著降低质量的前提下提升性能。与标准自回归生成和SoT相比,SGD实现了1.69倍加速,同时质量提升高达51%。