This paper primarily demonstrate a method to quantitatively assess the alignment between multi-step, structured reasoning in large language models and human preferences. We introduce the Alignment Score, a semantic-level metric that compares a model-produced chain of thought traces with a human-preferred reference by constructing semantic-entropy-based matrices over intermediate steps and measuring their divergence. Our analysis shows that Alignment Score tracks task accuracy across models and hop depths, and peaks at 2-hop reasoning. Empirical results further indicates that misalignment at greater reasoning depths is driven mainly by alignment errors such as thematic shift and redundant reasoning. Viewing chain sampling as drawing from a distribution over reasoning paths, we empirically demonstrate a strong and consistent correlation between Alignment Score and accuracy performance, supporting its use as a meaningful diagnostic signal for structured reasoning.
翻译:本文主要展示了一种量化评估大语言模型中多步骤结构化推理与人类偏好间对齐程度的方法。我们提出了对齐分数这一语义层面的度量指标,通过构建基于语义熵的中间步骤矩阵并测量其差异,将模型生成的思维链轨迹与人类偏好的参考链进行比较。分析表明,对齐分数能够跨模型和推理深度追踪任务准确率,并在二步推理中达到峰值。实证结果进一步指出,在更大推理深度下的未对齐现象主要由主题偏移和冗余推理等对齐错误驱动。通过将思维链采样视为从推理路径分布中抽取样本,我们实证证明了对齐分数与准确性表现之间存在强且一致的相关性,支持其作为结构化推理的有意义诊断信号。