Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.
翻译:大语言模型展现出强大的推理性能,但如自洽性推理等推断策略计算成本高昂,因为它们需完全扩展所有推理轨迹。本文提出PoLR(最小阻力路径),这是首个利用前缀一致性实现计算高效推理的推断时方法。PoLR对推理轨迹的短前缀进行聚类,识别主导聚类,并扩展该聚类中的所有路径,在保留自洽性推理精度优势的同时,显著降低令牌使用量和延迟。我们通过互信息和熵构建的理论分析,阐释了早期推理步骤为何能编码预测最终正确性的强信号。实证研究表明,在GSM8K、MATH500、AIME24/25和GPQA-DIAMOND数据集上,PoLR持续匹配或超越自洽性推理,令牌使用量最高减少60%,实际延迟最高降低50%。此外,PoLR与自适应推理方法(如自适应一致性、早期停止自洽性推理)完全兼容,可作为即插即用的预过滤器,在无需模型微调的情况下,显著提升自洽性推理的效率和可扩展性。