Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

翻译：大语言模型可能通过不稳定、矛盾或难以一致排序的推理路径得出相同答案——这种失效模式在多步演绎推理中尤为普遍。现有方法主要通过输出离散度（衡量采样答案的差异程度）评估可靠性，但这忽略了互补信号：模型能否对相互竞争的推理候选方案进行一致排序。我们提出结构不确定性，这是一种源自对采样推理方案自偏好诱导排序稳定性的、具一致性感知能力的框架。针对查询生成多个候选方案，然后要求模型对其自身输出进行成对偏好判断。通过基于PageRank的Bradley-Terry模型，我们将自偏好聚合成排序分布，并将信号分解为两个基于熵的组成部分：跨试验排序不稳定性与试验内候选歧义性。在五个大语言模型和八个基准测试上，结构信号提供了与答案离散度互补的信息：在逻辑与数学推理任务中，两者结合能改进对不可靠实例的识别；而在事实检索任务中，结构信号向均匀性坍缩，由此诊断出推理层面一致性评估失效的机制边界。这两个组成部分与准确率呈现不同关联：试验内歧义性与正确性正相关——这符合多个合理推理路径相互竞争的情景；而跨试验不稳定性与正确性负相关，标志着不可靠的推理。结构不确定性不应被理解为通用的置信度估计器，而应被视作一种具备机制敏感性的逻辑推理一致性评估工具。