UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

翻译：现实世界中的用户生成视频，尤其是在TikTok等平台上，通常包含丰富且交织的视听内容。然而，现有的视频描述基准与模型仍主要侧重于视觉信息，忽视了音频在传达场景动态、说话者意图和叙事语境中的关键作用。这种全方位数据集的缺乏以及轻量级、高性能模型的缺失，阻碍了细粒度、多模态视频理解的进展。为应对这些挑战，我们提出了UGC-VideoCap，这是一个专门为短视频用户生成内容设计的新基准和模型框架，旨在实现详细的全模态描述。与以往数据集不同，UGC-VideoCap强调音频与视觉模态的平衡整合，包含1000个TikTok视频，通过一个结构化的三阶段人机协同流程进行标注，涵盖纯音频、纯视觉以及联合视听语义。该基准还包含了4000个精心构建的问答对，用于探究单模态与跨模态理解。除了数据集，我们提出了UGC-VideoCaptioner(3B)，一个从Gemini 2.5 Flash蒸馏而来的30亿参数描述模型。采用一种新颖的两阶段训练策略——监督微调后接Group Relative Policy Optimization (GRPO)，我们的方法能够在有限数据下实现高效适应，同时保持有竞争力的性能。我们的基准与模型共同为在无约束的真实世界用户生成内容场景中推进全模态视频描述，提供了一个高质量的基础和数据高效的解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日