Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.
翻译:将长思维链(CoT)压缩为紧凑的潜在标记对于大语言模型(LLM)的高效推理至关重要。现有研究通常采用自编码器,通过从潜在标记重构文本CoT来实现这一目标,从而编码CoT语义。然而,将文本CoT作为重构目标会迫使潜在标记保留表层语言特征(如词汇选择和句法结构),引入强烈的语言归纳偏置,使语言形式优先于推理结构,从而限制逻辑抽象能力。为此,我们提出ImgCoT方法,将重构目标从文本CoT替换为通过将CoT渲染成图像而获得的视觉CoT。这一转变将语言偏置替换为空间归纳偏置,即倾向于建模视觉CoT中推理步骤的空间布局,使潜在标记能更好地捕捉全局推理结构。此外,尽管视觉潜在标记编码了抽象的推理结构,但可能模糊推理细节。因此我们提出松散型ImgCoT——一种混合推理方法,通过基于低标记对数似然筛选出的少量关键文本推理步骤来增强视觉潜在标记。该设计使LLM能够以比完整CoT更少的标记同时保留全局推理结构和细粒度推理细节。在多个数据集和LLM上的大量实验验证了两种版本ImgCoT的有效性。