Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.
翻译:大型语言模型通常通过思维链(CoT)更有效地解决复杂推理任务,但代价是生成长而低带宽的词元序列。相比之下,人类常采用软推理方式,即维持对可能后续步骤的概率分布。受此启发,我们提出多路思维——一种随机软推理机制,在每个思维步骤中采样K个候选词元,并将其嵌入向量聚合为单个连续的多路词元。该方法既保留了词汇嵌入先验和标准离散生成的采样动态,又能在多路展开上导出可处理的概率分布。因此,多路轨迹可直接通过同策略强化学习进行优化。值得注意的是,多路思维具有自适应性:当模型置信度高时,多路词元近乎离散,其行为类似标准CoT;当模型不确定时,它能紧凑地表示多个可能的后续步骤,且不增加序列长度。在具有挑战性的数学推理基准测试中,从Pass@1到Pass@1024的评估范围内,多路思维始终优于强离散CoT和强化学习基线方法,同时生成更短的序列。代码与模型检查点已发布于 https://github.com/GMLR-Penn/Multiplex-Thinking。