Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
翻译:思维链(Chain-of-Thought, CoT)提示在释放大型语言模型(Large Language Models, LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长性带来了大量计算开销。近期工作通常仅关注结果对齐,缺乏对中间推理过程的监督。这些缺陷模糊了潜在推理链的可分析性。为应对这些挑战,我们提出思想渲染(Render-of-Thought, RoT),这是首个通过将文本步骤渲染为图像来具体化推理链的框架,使潜在推理过程显式化且可追踪。具体而言,我们利用现有视觉语言模型(Vision Language Models, VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这一设计确保了即插即用的实现,无需引入额外的预训练开销。在数学和逻辑推理基准上的大量实验表明,与显式CoT相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,它在与其他方法的对比中保持了竞争性能,验证了这一范式的可行性。我们的代码已开源在https://github.com/TencentBAC/RoT