Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.
翻译:长思维链(Long CoTs)在多模态推理模型中被广泛用于处理复杂任务,以捕捉详细的视觉信息。然而,这些长思维链往往过于冗长且包含冗余的推理步骤,这可能会阻碍推理效率。压缩这些长思维链是一种自然的解决方案,但现有方法面临两大挑战:(1)它们可能通过移除必要的对齐线索而损害视觉-文本推理的完整性;(2)压缩过程缺乏可解释性,使得难以辨别哪些信息是关键的。为了解决这些问题,我们提出了XMCC,一种可解释的多模态思维链压缩器,它将压缩建模为一个通过强化学习优化的序列决策过程。XMCC能够有效缩短推理轨迹,同时保留关键的推理步骤和答案正确性,并为其压缩决策生成自然语言解释。在具有代表性的多模态推理基准上进行的大量实验表明,XMCC不仅减少了推理长度,还提供了可解释的解释,验证了其有效性。