Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
翻译:推理超越了语言范畴;现实世界需要对空间、时间、可供性等进行推理,而这些远非仅凭文字所能传达。现有探索利用图像进行推理潜力的多模态模型既脆弱又难以扩展。它们依赖调用专用工具、成本高昂的图像生成或手工制作的推理数据在文本与图像思维之间切换。相反,我们提供了一种更简单的替代方案——Mull-Tokens——这是一种预先训练的模态无关潜在标记,能够以图像或文本模态保存中间信息,使模型以自由形态向正确答案思考。我们借鉴潜在推理框架,研究了训练Mull-Tokens的最佳实践。我们首先利用文本-图像交错轨迹的监督信号训练Mull-Tokens,随后仅使用最终答案进行无监督微调。在涉及解谜、多视角推理等任务的四个具有挑战性的空间推理基准测试中,我们证明Mull-Tokens在多个仅使用文本推理或交错图文推理的基线上均有提升,与最强基线相比,平均提升+3%,在重推理的解谜子集上提升高达+16%。在围绕文本与视觉推理接地挑战的讨论中,Mull-Tokens为多模态抽象思考提供了简洁的解决方案。