Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Jingwei Ni,Ekaterina Fadeeva,Tianyi Wu,Mubashara Akhtar,Jiaheng Zhang,Elliott Ash,Markus Leippold,Timothy Baldwin,See-Kiong Ng,Artem Shelmanov,Mrinmaya Sachan

from arxiv, Preprint under review

LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve LLM performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and strategically choosing the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive, limited to specific domains, and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of the frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be generated either by another larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are both effective and lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or even exceed the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their confidence in reasoning processes and can serve as reliable signals for reasoning step verification, offering a promising direction towards scalable and generalizable TTS and introspective LLMs.

翻译：大语言模型能够通过生成长链式多步推理来解决复杂任务。测试时扩展技术通过采样多个中间推理步骤的变体、验证其正确性并策略性地选择最优步骤进行延续，可进一步提升大语言模型的性能。然而，现有的验证方法（如过程奖励模型）存在计算成本高昂、局限于特定领域且需要大规模人工或模型生成标注的局限性。本文提出一种基于大语言模型内部状态探测的轻量级步骤级推理验证方案。我们训练了一个基于Transformer的探测模型，该模型利用冻结大语言模型的内部状态来评估其在生成过程中推理步骤的可信度。标注数据可由另一个更大的大语言模型（如DeepSeek-R1）生成，也可通过原始模型自身以自监督方式产生。该探测模型兼具高效性与轻量化特性，参数量不足1000万。在数学、规划及常识问答等多个领域，我们的探测模型性能达到甚至超越了参数量高达其810倍的过程奖励模型。研究结果表明，大语言模型的内部状态编码了其对推理过程的置信度，可作为推理步骤验证的可靠信号，为可扩展、可泛化的测试时扩展技术及自省式大语言模型的发展提供了新方向。