StreamingThinker：大语言模型可实现边读边思考 (StreamingThinker: Large Language Models Can Think While Reading)

Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}

翻译：大语言模型在思维链推理方面已展现出卓越能力。然而，当前LLM推理范式仅在完整输入就绪后才启动思考，这在动态场景中会引入不必要的延迟并削弱对早期信息的关注。受人类边阅读边思考的认知机制启发，我们首次为大语言模型设计了一种\textbf{流式思考}范式，使推理能按输入顺序展开，并在阅读完成后进一步调整其思考深度。我们通过\textit{StreamingThinker}框架实例化该范式，该框架通过集成流式CoT生成、流式约束训练与流式并行推理，使LLM能够实现边读边思考。具体而言，StreamingThinker采用带质量控制的流式推理单元进行CoT生成，通过流式注意力掩码与位置编码强制保持推理顺序，并利用并行KV缓存将输入编码与推理生成解耦，从而确保对齐性并实现真正的并发处理。我们在Qwen3模型系列上对StreamingThinker进行了数学推理、逻辑推理及基于上下文的问答推理任务评估。实验结果表明，StreamingThinker在保持与批量思考相当性能的同时，将推理开始前的令牌等待时间减少了80%，并将生成最终答案的时间级延迟降低了60%以上，证明了流式范式在LLM推理中的有效性。代码将发布于\href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{此代码库}。