Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall real-time delays by up to $12{\times}$.
翻译:许多最先进的大语言模型被训练为在给出答案前进行思考。推理能力能极大提升语言模型的能力,但也降低了其交互性:面对新输入时,模型必须停止思考才能作出响应。现实世界中的语音助手或具身智能体等应用场景,要求大语言模型智能体能够实时响应并适应新增信息,这与顺序交互模式存在根本矛盾。相比之下,人类能够异步执行听、思、行:我们在阅读问题时即开始思考,并在组织答案时持续思考。本研究通过增强具备推理能力的大语言模型,使其无需额外训练即可实现类似异步运作。我们的方法利用位置编码的特性,使原本为顺序生成设计的大语言模型能够同步执行思考、听取输入和输出生成。我们在数学推理、常识推理和安全推理任务上评估该方法:模型在生成准确思考增强型答案的同时,将首个非思考标记的生成时间从数分钟缩短至${\le}$5秒,并将整体实时延迟降低高达$12{\times}$。