Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.
翻译:近年来,多模态大语言模型的推理能力有了显著提升。现有方法通常依赖显式思维链或连续隐空间轨迹来增强多步推理。然而,这些方法普遍假设输入具有单一隐式解释,并沿固定路径或在统一计算预算下展开推理。在实际多模态场景中,视觉观测常受遮挡、模糊、视角变化或语义歧义影响,从而产生多种合理解释。统一的推理策略不仅限制了模型探索多个假设的能力,还导致高内存占用和推理成本。我们提出DLWM(多样隐世界模型),一种结合隐空间推理与强化学习的多模态推理框架。首先,我们在连续隐空间中构建一组多样的隐世界假设,每个假设捕捉视觉输入的不同合理诠释,并在各假设上独立展开隐推理。基于正交性的多样性正则化器可有效防止假设坍缩。其次,我们将隐推理过程建模为资源约束的序贯决策问题,并引入资源感知的强化学习策略,该策略自适应地在各假设间分配计算资源,动态决定是否扩展、终止或合并推理路径,从而显著降低内存占用并提升推理效率。在多个多模态推理基准上的实验表明,DLWM在准确率上比现有方法提升2-5个百分点,同时内存使用减少24%。