Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.
翻译:长上下文推理极大地增强了大语言模型处理复杂任务的能力,但也因其计算复杂性而引入了严重的效率瓶颈。现有的高效方法通常依赖于复杂的额外训练或外部模型进行压缩,这限制了可扩展性并丢弃了关键的细粒度信息。在本文中,我们提出了VTC-R1,一种新的高效推理范式,它将视觉-文本压缩集成到推理过程中。VTC-R1不是处理冗长的文本轨迹,而是将中间推理片段渲染成紧凑的图像,这些图像作为"光学记忆"迭代地反馈给视觉-语言模型。我们基于OpenR1-Math-220K构建了一个训练数据集,实现了3.4倍的token压缩,并对代表性视觉语言模型Glyph和Qwen3-VL进行了微调。在MATH500、AIME25、AMC23和GPQA-D等基准测试上进行的大量实验表明,VTC-R1始终优于标准的长上下文推理。此外,我们的方法显著提高了推理效率,实现了端到端延迟2.7倍的加速,突显了其作为推理密集型应用可扩展解决方案的潜力。我们的代码可在 https://github.com/w-yibo/VTC-R1 获取。