GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the "Observe-Think-Act-Verify" flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

翻译：基于音乐的混剪视频创作是非线性视频编辑中一种具有挑战性的形式，要求系统在兼顾音乐节奏、用户意图、故事完整性和长程结构约束的前提下，从大量源视频中构建连贯的时间线。现有方法通常依赖固定流程或简化的检索-拼接范式，限制了其对多样化提示和异构源素材的适应能力。本文提出GLANCE——一种面向音乐驱动非线性视频编辑的全局-局部协调多智能体框架。GLANCE采用双环架构以实现更优的编辑实践：外环执行长程规划与任务图构建，内环通过“观察-思考-行动-验证”流程逐段完成编辑任务与优化。针对子时间线组合后出现的跨片段与全局冲突，我们引入专门的全局-局部协调机制，包含预防性与纠正性组件，具体包括新型设计的上下文控制器、冲突区域分解模块以及自底向上的动态协商机制。为支持严格评估，我们构建了MVEBench新基准，按任务类型、提示特异性和音乐长度分解编辑难度，并提出“智能体即裁判”评估框架以实现可扩展的多维度评价。实验结果表明，GLANCE在相同骨干模型下持续优于先前研究基线及开源产品基线。以GPT-4o-mini为骨干模型时，GLANCE在两项任务设置上分别比最强基线提升33.2%和15.6%。人工评估进一步验证了生成视频的质量及所提评估框架的有效性。