LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar,Dinura Dissanayake,Ketan More,Ritesh Thawkar,Ahmed Heakl,Noor Ahsan,Yuhao Li,Mohammed Zumri,Jean Lahoud,Rao Muhammad Anwer,Hisham Cholakkal,Ivan Laptev,Mubarak Shah,Fahad Shahbaz Khan,Salman Khan

from arxiv, 15 pages, 5 Figures

Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.

翻译：推理是解决复杂多步骤问题的基本能力，在需要顺序性逐步理解的视觉场景中尤为重要。现有方法缺乏评估视觉推理的综合框架，且未强调逐步问题求解。为此，我们通过三项关键贡献提出了推进大型语言模型（LLMs）中逐步视觉推理的综合框架。首先，我们引入专门用于评估多步骤推理任务的视觉推理基准。该基准涵盖从复杂视觉感知到科学推理的八类多样化挑战，总计包含超过4000个推理步骤，能够稳健评估LLMs在多个步骤中执行准确且可解释的视觉推理的能力。其次，我们提出一种新颖的指标，以单个步骤的粒度评估视觉推理质量，同时强调正确性与逻辑连贯性。相较于传统的终端任务准确率指标，该指标能为推理性能提供更深入的洞察。第三，我们提出名为LlamaV-o1的新型多模态视觉推理模型，该模型采用多步骤课程学习方法训练，通过渐进式任务组织促进增量式技能获取与问题求解。所提出的LlamaV-o1专为多步骤推理设计，通过结构化训练范式进行逐步学习。大量实验表明，我们的LlamaV-o1优于现有开源模型，并与闭源专有模型表现相当。相较于近期提出的Llava-CoT，我们的LlamaV-o1在六个基准测试中平均得分达67.3，绝对提升3.8%，同时推理速度提升5倍。我们的基准、模型及代码均已公开。