The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
翻译:移动设备与社交媒体的普及彻底改变了内容分发方式,短视频日益成为主流。这种转变带来了视频重帧的挑战——需要将视频适配不同屏幕宽高比,同时突出画面中最具吸引力的部分。传统视频重帧是依赖专业经验的人工操作,耗时且成本高昂。虽然可以采用视频显著性目标检测等机器学习模型实现自动化,但这些方法因依赖特定训练数据而普遍缺乏泛化能力。强大的大语言模型(LLMs)的出现为人工智能开辟了新方向。基于此,我们提出基于LLM的智能体Reframe Any Video Agent (RAVA),它通过整合视觉基础模型与人类指令来重构视频内容,实现视频重帧。RAVA采用三阶段工作流程:感知阶段解读用户指令与视频内容;规划阶段确定目标宽高比与重帧策略;执行阶段调用编辑工具生成最终视频。实验证明RAVA在视频显著性目标检测和真实场景重帧任务中的有效性,展现了其作为AI驱动视频编辑工具的潜力。