This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and showcases the potential of eliciting high-quality answers by explicitly planning the answer structure in language.
翻译:本研究旨在降低大型语言模型(LLMs)的端到端生成延迟。高生成延迟的主要原因之一是目前几乎所有最先进LLMs均采用的顺序解码方法。受人类思考与写作过程的启发,我们提出"骨架思维"(SoT)方法:首先引导LLMs生成答案的骨架,随后通过并行API调用或批量解码方式,并行完成各骨架点的内容填充。SoT不仅在12种LLMs上实现了显著的加速效果,还有望在多个问题类别中提升答案质量。作为面向推理效率的数据中心优化初步尝试,SoT展示了通过明确规划语言中的答案结构来激发高质量答案的潜在价值。