Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.
翻译:电影音频源分离(CASS)旨在将混合的电影音频分解为语音、音乐和音效,从而支持配音和重制等应用。现有CASS方法仅依赖音频,忽略了电影固有的音视频特性——声音往往与视觉线索同步。我们提出首个音视频CASS(AV-CASS)框架,利用视觉上下文提升分离质量。该方法将CASS建模为基于条件流匹配的条件生成问题,实现多模态音频源分离。针对缺乏包含独立音轨的电影数据集的问题,我们引入训练数据合成流水线,通过配对野外音频与视频流(如语音对应人脸视频、音效对应场景视频),并为该双流架构设计专用视觉编码器。模型完全基于合成数据训练,却能有效泛化至真实电影内容,并在合成数据、真实场景及纯音频CASS基准测试中均表现优异。代码与演示见\url{https://cass-flowmatching.github.io}。