Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
翻译:现实世界中用户生成的短视频,尤其是在微信视频号、抖音等平台上分发的短视频,主导着移动互联网。然而,当前的大型多模态模型缺乏关键的时间结构化、详细且深入的视频理解能力,而这正是有效视频搜索与推荐以及新兴视频应用的基石。理解现实世界短视频实际上颇具挑战性,原因在于其复杂的视觉元素、视觉与音频中蕴含的高信息密度,以及专注于情感表达和观点传递的快速节奏。这需要先进的推理能力,以有效整合包括视觉、音频和文本在内的多模态信息。在本工作中,我们介绍了ARC-Hunyuan-Video,这是一个多模态模型,能够端到端地处理来自原始视频输入的视觉、音频和文本信号,以实现结构化理解。该模型能够进行多粒度带时间戳的视频描述与摘要、开放式视频问答、时序视频定位以及视频推理。利用自动化标注流程生成的高质量数据,我们紧凑的7B参数模型通过一个全面的训练方案进行训练:预训练、指令微调、冷启动、强化学习(RL)后训练以及最终的指令微调。在我们引入的基准测试ShortVid-Bench上的定量评估以及定性比较表明,其在现实世界视频理解方面表现出色,并且支持零样本或少样本微调以用于多样化的下游应用。我们模型在现实生产环境中的部署,已在用户参与度和满意度方面带来了切实且可衡量的提升,这一成功得益于其卓越的效率——压力测试表明,在H20 GPU上处理一分钟视频的推理时间仅需10秒。