Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
翻译:视频因其能捕捉跨越多个帧的动作而独一无二。因此,多年来动作识别一直是视频理解的核心任务。然而,由于缺乏足够多样化和具有挑战性的数据,现代视觉-语言模型(VLMs)已不再基于其动作识别能力进行评估。为了在VLM时代重振动作识别研究,我们主张回归对领域特定动作的关注。为此,我们提出了VideoNet,一个涵盖37个领域、包含1000种不同动作的领域特定动作识别基准。我们首先采用多项选择评估设置,其中封闭模型与开放模型之间的差异显著:Gemini 3.1 Pro达到69.9%的准确率,而Qwen3-VL-8B仅获45.0%。为探究VLM在VideoNet上表现不佳的原因,我们将问题放宽至二元设置(此时随机猜测准确率为50%),但Qwen仍仅达到59.2%的准确率。进一步放宽评估设置后,我们提供k∈{1,2,3}个动作的上下文内示例。部分模型在少样本场景下表现优异,而另一些则表现欠佳:Qwen提升+7.0%,Gemini下降-4.8%。值得注意的是,这些增益低于非专家人类在获得少样本示例时+13.6%的提升。基于VLM难以充分利用上下文内示例的发现,我们将研究重心从测试时改进转向训练阶段。我们收集了首个面向领域特定动作的大规模训练数据集,总计近50万视频问答对。通过在我们的数据上微调Molmo2-4B模型,我们在VideoNet基准上超越了所有开源8B权重模型。