Are Latent Reasoning Models Easily Interpretable?

Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.

翻译：潜在推理模型（LRMs）因其低推理成本（相对于显式推理模型）以及在理论上能够并行探索多条推理路径的能力而引起了广泛的研究兴趣。然而，这些优势是以降低可解释性为代价的：LRMs难以监控，因为它们并非以自然语言进行推理。本文通过研究两种最先进的LRMs来探讨其可解释性。首先，我们发现潜在推理token通常对LRMs的预测并非必要；在逻辑推理数据集上，LRMs几乎始终能在完全不使用潜在推理的情况下得出相同的最终答案。这种对推理token的低利用率可能部分解释了为何LRMs并未持续优于显式推理方法，并对先前工作中这些token所述的作用提出了质疑。其次，我们证明，当潜在推理token对性能不可或缺时，对于正确预测的实例，我们最多有65-93%的概率能够解码出正确的推理轨迹。这表明LRMs经常执行预期的解决方案而非一种不可解释的推理过程。最后，我们提出了一种方法，能够在无需预先知晓正确推理轨迹的情况下，从潜在token解码出经验证的自然语言推理轨迹，从而证明对于大多数正确预测（但仅对少数错误预测）有可能找到一条经过验证的轨迹。我们的发现强调了当前的LRMs在很大程度上编码了可解释的过程，并且可解释性本身可以作为预测正确性的一个信号。