E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
翻译:电子商务短视频作为在线视频产业的高收益细分领域,其特点在于目标导向的呈现形式与密集的多模态信号。现有模型在处理此类视频时常面临困难,因为当前主流基准主要关注通用任务而忽视了对商业意图的推理分析。本研究首先提出多模态信息密度评估框架以量化该领域的复杂度。评估结果表明,相较于主流数据集,电商内容在视觉、音频和文本模态上均表现出显著更高的信息密度,为视频理解建立了更具挑战性的前沿标准。为填补这一空白,我们提出首个专门针对电商短视频理解的基准——电子商务视频广告基准(E-VAds)。我们从淘宝平台精选3,961条涵盖广泛商品品类的高质量视频,并采用多智能体系统生成19,785组开放式问答对。这些问题被组织为感知与认知推理两大维度,包含五项具体任务。最后,我们开发了基于强化学习的推理模型E-VAds-R1,其采用名为MG-GRPO的多粒度奖励设计策略。该策略为早期探索提供平滑引导,同时为专家级精度建立非线性激励机制。实验结果表明,E-VAds-R1仅使用数百个训练样本即可在商业意图推理任务上实现109.2%的性能提升。