Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall delays by up to $12{\times}$.
翻译:许多最先进的大语言模型在给出答案前会经过训练以进行思考。推理能显著提升语言模型的能力,但也使其交互性降低:面对新输入时,模型必须停止思考才能作出响应。现实应用场景(如语音助手或具身智能体)要求大语言模型实时响应并适应额外信息,这与顺序交互模式不兼容。相比之下,人类能够异步地倾听、思考与行动:我们在阅读问题时便开始思考,并在构建答案时持续思考。本研究在不进行额外训练的前提下,增强了具备推理能力的大语言模型的异步操作能力。该方法利用位置嵌入的特性,使原本设计用于顺序生成的模型能够同时进行思考、接收输入和生成输出。我们在数学、常识和安全推理任务上评估了该方法:它使模型在生成增强思考的准确答案的同时,将首个非思考令牌的生成时间从数分钟缩短至≤5秒,总延迟降低高达12倍。