Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
翻译:大型语言模型(LLM)往往难以通过模仿人类或非长链推理LLM来习得有效的长链推理能力。为探究此问题,我们提出在统一视角下,有效且可习得的长链推理轨迹具有类似分子的稳定结构,该结构由三种相互作用类型构成:深度推理(类共价键)、自我反思(类氢键)与自我探索(类范德华力)。对蒸馏轨迹的分析表明,这些结构源于长链推理微调过程,而非关键词模仿。我们提出有效语义异构体概念,并证明仅能促进熵快速收敛的键合才能支撑稳定的长链推理学习,而结构竞争会损害训练效果。基于这些发现,我们提出Mole-Syn——一种分布转移图方法,可引导有效长链推理结构的合成,在多项基准测试中显著提升性能与强化学习稳定性。