StreamingThinker：大型语言模型可在阅读过程中同步思考 (StreamingThinker: Large Language Models Can Think While Reading)

Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker.

翻译：大型语言模型（LLMs）在思维链（CoT）推理方面已展现出卓越能力。然而，当前LLM推理范式需在完整输入就绪后才启动思考，这在动态场景中会引入不必要的延迟并削弱对早期信息的注意力。受人类“边读边思”认知机制的启发，我们首次为LLMs设计了\textbf{流式思考}范式，使推理过程按输入顺序展开，并在阅读完成后进一步调整思考深度。我们通过\textit{StreamingThinker}框架实例化该范式，该框架通过整合流式CoT生成、流式约束训练和流式并行推理，使LLMs能够实现边读边思。具体而言，StreamingThinker采用带质量控制的流式推理单元进行CoT生成，通过流式注意力掩码和位置编码强制保持推理顺序，并利用并行KV缓存将输入编码与推理生成解耦，从而确保对齐性并实现真正的并发。我们在Qwen3模型系列上对StreamingThinker进行数学推理、逻辑推理和基于上下文的问答推理任务评估。实验结果表明，StreamingThinker在保持与批量思考相当性能的同时，将推理启动前的令牌等待时间减少80%，最终答案生成的时间级延迟降低60%以上，证明了流式范式对LLM推理的有效性。代码将在https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker发布。

相关内容