Agent-based Video Trimming

As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at https://ylingfeng.github.io/AVT.

翻译：随着信息获取日益便捷，用户生成视频的长度持续增加，使得观众需要从海量内容中筛选有价值信息，这带来了显著的观看负担。这一趋势凸显了对高效提取视频关键信息算法的需求。尽管在亮点检测、片段检索和视频摘要等领域已取得显著进展，但现有方法主要集中于选择特定时间区间，往往忽略了片段间的关联性以及片段重排的潜力。本文提出一种名为视频剪辑的新任务，其核心在于检测冗余镜头、选取有价值片段，并将它们组合成具有连贯叙事的最终视频。为应对此任务，我们提出基于智能体的视频剪辑方法，其结构包含三个阶段：视频结构化、片段过滤与故事合成。具体而言，我们采用视频描述智能体将视频切片转化为结构化文本描述；通过过滤模块依据每个片段的结构化信息动态剔除低质量镜头；并利用视频编排智能体筛选有效片段并将其编译为连贯的最终叙事。为进行评估，我们开发了视频评估智能体对剪辑后视频进行自动化评测，并与人工评估并行开展。此外，我们利用来自互联网的原始用户视频构建了全新的视频剪辑基准数据集。实验结果表明，在用户研究中，基于智能体的视频剪辑方法获得了更积极的评价，并在YouTube Highlights、TVSum及我们自建数据集的亮点检测任务中展现出更优的平均精度均值与精确率指标。代码与模型已发布于https://ylingfeng.github.io/AVT。