推理与创造力的权衡：迈向创造力驱动的问题求解 (The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving)

State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.

翻译：当前最先进的大型语言模型（LLM）流程依赖于自举推理循环：采样多样化的思维链并强化得分最高的路径，主要优化正确性。我们分析了这种设计选择如何导致模型在推理路径上的分布崩溃，从而大幅降低语义熵并削弱创造性问题求解能力。为分析这一失效机制，我们提出了分布化创造性推理（DCR）——一种统一的变分目标，将训练过程建模为通过解迹概率测度的梯度流。STaR、GRPO、DPO以及熵奖励等现有方法均可视为该损失函数的特例。该框架产生三项核心成果：（i）多样性衰减定理，描述了基于正确性的目标如何导致STaR、GRPO、DPO产生不同模式的多样性衰减；（ii）确保收敛至稳定且多样化策略的设计方案，能有效防止分布崩溃；（iii）可在实践中直接应用的简洁操作方案。DCR由此为LLM提供了首个保持正确性与创造性的原则性设计范式。