Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.
翻译:从异构素材中编辑长视频不仅需要选取片段:智能体需确保在整个素材准备、时间线构建、后期制作与修订过程中保持叙事意图,同时留下足够证据以诊断失败。我们提出**Crayotter**,一个用于提示驱动视频编辑的开源多模态多智能体系统。Crayotter将制作流程组织为三个阶段:关注覆盖率的素材准备、基于工件的编辑研究,以及基于工具的时间线执行。每个阶段均产出可供检查的工件,包括覆盖率报告、多模态分析、编辑蓝图、工具调用记录及中间渲染结果。这些工件使编辑过程可追溯,并允许对失败片段进行诊断与选择性修订,而非完全重启。我们在23个编辑主题上对Crayotter进行了评估,并与CapCut-Mate和CutClaw进行对比。人工评估显示,Crayotter的平均得分为3.40/5,而两个基线分别为2.44和1.70,且在主题对齐、叙事连贯性和编辑流畅性方面均表现出持续优势。此外,我们描述了一种可回放的轨迹模式及可验证的奖励设计,为未来策略优化奠定基础。代码、轨迹及示例见https://github.com/idwts/Crayotter。