Reasoning-tuned LLMs utilizing long Chain-of-Thought (CoT) excel at single-answer tasks, yet their ability to model Human Label Variation--which requires capturing probabilistic ambiguity rather than resolving it--remains underexplored. We investigate this through systematic disentanglement experiments on distribution-based tasks, employing Cross-CoT experiments to isolate the effect of reasoning text from intrinsic model priors. We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content (99% variance contribution), whereas distributional ranking is governed by model priors (over 80%). Step-wise analysis further shows that while CoT's influence on accuracy grows monotonically during the reasoning process, distributional structure is largely determined by LLM's intrinsic priors. These findings suggest that long CoT serves as a decisive LLM decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.
翻译:采用长思维链(CoT)进行推理调优的大语言模型在单答案任务上表现出色,但其建模人类标注变异的能力——这需要捕捉概率模糊性而非解决它——仍未得到充分探索。我们通过在基于分布的任务上进行系统性解耦实验来研究此问题,采用交叉CoT实验以分离推理文本与模型内在先验的影响。我们观察到一个明显的“解耦机制”:虽然CoT改善了分布对齐,但最终准确率由CoT内容决定(贡献99%方差),而分布排序则由模型先验主导(超过80%)。逐步分析进一步表明,尽管CoT对准确率的影响在推理过程中单调增长,但分布结构主要由大语言模型的内在先验决定。这些发现表明,长CoT可作为大语言模型对首选答案的决策者,但无法作为模糊任务的细粒度分布校准器。