Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

翻译：联合音频与视频生成领域已因强大基础模型的出现而发生根本性变革。尽管取得这些进展，如何在多个交互主体中同时保持视觉身份和音色特征一致性的多模态定制技术仍鲜有探索。为填补这一空白，我们提出Omni-Customizer——一个面向多模态身份信息精准绑定与无缝融合的端到端框架。具体而言，我们引入Omni-Context Fusion (OCF) 模块，通过稠密的多模态身份线索有效增强基础文本提示，同时设计显式防止严重"语音泄漏"问题的Masked TTS Cross-Attention (MTP-CA) 机制。在该架构中，我们提出Semantic-Anchored Multimodal RoPE (SA-MRoPE) 将视觉与音频参考标记及TTS嵌入锚定至对应语义描述，实现结构化多模态融合与稳健身份绑定。此外，我们设计了包含交错音视频调度的综合训练策略，使音频分支在不损失基础先验知识的前提下快速适配多语言场景，并采用渐进式内对内到跨对课程学习促进高层次鲁棒身份特征的习得。大量实验表明，Omni-Customizer在双模态定制生成任务中达到当前最优性能，在视觉身份相似度、音色一致性、精准音视频同步及整体视频音频保真度方面均表现优异。