Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

翻译：近期思维链生成技术的进展显著提升了大型语言模型的推理能力，其中强化学习作为一种有效的后训练方法崭露头角。多模态大语言模型继承了这种推理潜力，但在需要感知与逻辑推理协同的任务中仍未得到充分探索。为此，我们提出了SEED-Bench-R1——一个为系统评估多模态大语言模型在视频理解任务中的后训练方法而设计的基准测试。该基准包含以选择题形式呈现的复杂真实世界视频与日常规划任务，要求模型具备精细的感知与推理能力。SEED-Bench-R1通过三层递进结构评估模型泛化性能：分布内场景、跨环境场景及跨环境-跨任务场景，并配备具有易验证标准答案的大规模训练数据集。以Qwen2-VL-Instruct-7B作为基础模型，我们对比了强化学习与监督微调的性能，证明强化学习在数据效率及分布内外任务上均表现优异，甚至在LongVideoBench等通用视频理解基准上超越监督微调。深入分析表明，强化学习能增强视觉感知能力，但常产生逻辑连贯性较弱的推理链。我们指出了推理不一致性及视觉线索忽略等关键局限，并对基础模型推理能力、奖励建模及强化学习抗噪声鲁棒性等方面提出了未来改进方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日