Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
翻译:思维链(CoT)提示在释放大型语言模型(LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长的特性带来了巨大的计算开销。近期研究往往仅关注结果对齐,而缺乏对中间推理过程的监督。这些不足使得潜在推理链的可分析性变得模糊。为解决这些挑战,我们提出了思维渲染(RoT)框架,这是首个通过将文本推理步骤渲染为图像来具象化推理链的方法,使潜在推理依据变得显式且可追溯。具体而言,我们利用现有视觉语言模型(VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现方式,无需额外的预训练开销。在数学和逻辑推理基准上的大量实验表明,与显式CoT相比,我们的方法实现了3-4倍的标记压缩和显著的推理加速。此外,该方法在其他方法面前保持了有竞争力的性能,验证了这一范式的可行性。我们的代码公开于https://github.com/TencentBAC/RoT