Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response, recent works have sought to decrease response lengths through simple prompting strategies (e.g. 'be concise'). In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions (e.g. 'use 10 words or less' or 'remove all punctuation'). In doing so, we discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic 'token complexity' - a minimal number of tokens required for successful problem-solving. We show how token complexity enables us to compute information-theoretic limits on the accuracy-compression tradeoff, and find that prompt-based compression strategies operate far from these theoretical limits. This suggests there may be significant room for improvement and our framework provides a benchmark to help researchers evaluate progress in reasoning efficiency. Our work also highlights the importance of adaptive compression -- giving shorter responses for easier questions -- and we show that token complexity is a useful tool for measuring this capability.
翻译:思维链提示已成为赋能大型语言模型解决复杂推理任务的有力技术。然而,这些推理链往往冗长,引发了关于效率的担忧。为此,近期研究尝试通过简单提示策略(例如“保持简洁”)来缩短响应长度。本研究首次系统性地探究了在不同压缩指令(例如“使用10个或更少词语”或“删除所有标点”)下,推理长度与模型性能之间的关系。通过分析,我们发现推理长度与准确性之间存在普遍权衡关系,这种关系甚至在不同推理链中持续存在。我们证明该权衡源于问题层面的尖锐阈值行为:每个任务都具有固有的“令牌复杂度”——成功解决问题所需的最小令牌数量。我们展示了令牌复杂度如何帮助我们计算准确性与压缩权衡的信息论极限,并发现基于提示的压缩策略远未达到这些理论极限。这表明该领域可能存在显著的改进空间,而我们的框架为评估推理效率进展提供了基准。我们的研究还强调了自适应压缩的重要性——对较简单问题给予更短响应——并证明令牌复杂度是衡量这种能力的有效工具。