Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +26.2 percentage points on ScienceQA and +9.1 percentage points on EgoSchema.
翻译:视觉语言模型在多模态理解与推理任务中展现了强大性能,但其多步推理仍存在不稳定问题。对同一输入的重复采样常产生分歧的推理轨迹及不一致的最终预测结果。为此,我们引入两种受测试时缩放启发的互补方法:(1) CASHEW——一种推理时框架,通过迭代聚合多条候选轨迹生成更高质量的推理链,并借助显式视觉验证过滤幻觉步骤,将推理过程锚定于视觉证据;(2) CASHEW-RL——一种学习型变体,将这种聚合行为内化至单一模型中。CASHEW-RL采用群组序列策略优化(GSPO)进行训练,通过复合奖励函数鼓励基于最小且充分视觉证据的正确答案,同时根据任务难度自适应分配推理努力。该训练目标使模型在推理时具备稳健的自聚合能力。在13项图像理解、视频理解与视频推理基准上的广泛实验表明,模型性能显著提升,其中在ScienceQA上提升达+26.2个百分点,在EgoSchema上提升达+9.1个百分点。