多功能多模态代理：面向多媒体内容生成 (A Versatile Multimodal Agent for Multimedia Content Generation)

With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.

翻译：随着AIGC（人工智能生成内容）技术的发展，越来越多的生成模型正在革新视频编辑、音乐生成乃至电影制作等领域。然而，由于当前AIGC模型的局限性，大多数模型仅能作为特定应用场景中的独立组件，无法在实际应用中端到端地完成任务。在实际应用中，编辑专家通常需要处理多样化的图像与视频输入，并生成多模态输出——视频通常包含音频、文本等元素。这种跨多模态的整合能力是现有模型难以有效实现的。然而，基于代理系统的兴起使得利用AI工具处理复杂内容生成任务成为可能。为应对复杂场景，本文提出一种旨在自动化复杂内容创作的多媒体代理（MultiMedia-Agent）。该代理系统包含数据生成流水线、内容创作工具库以及用于评估偏好对齐的度量指标集。值得注意的是，我们引入技能习得理论来建模训练数据筛选与代理训练过程。针对规划优化设计了两阶段关联策略，包括自关联与模型偏好关联。此外，我们通过三阶段方法（基础/成功计划微调与偏好优化）利用生成的计划训练多媒体代理。对比实验结果表明，所提方法具有有效性，且多媒体代理相较于前沿模型能生成更优质的多媒体内容。