ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.

翻译：近期视频到音频（V2A）生成技术实现了从视觉内容到高质量音频的合成，但实现鲁棒且精细的可控性仍具挑战。现有方法存在以下问题：在视觉-文本冲突下文本可控性较弱，以及因参考音频中时序与音色信息纠缠导致的风格控制精度不足。此外，缺乏标准化基准限制了系统性评估。我们提出ControlFoley——统一的多模态V2A框架，支持对视频、文本和参考音频的精确控制。通过引入联合视觉编码范式，将CLIP与时空音视频编码器相结合以提升对齐与文本可控性；进一步提出时序-音色解耦方法，在保留判别性音色特征的同时抑制冗余时序信息。我们设计了一种模态鲁棒训练方案，包含统一多模态表征对齐（REPA）与随机模态丢弃机制。此外，构建了VGGSound-TVC基准，用于评估不同视觉-文本冲突程度下的文本可控性。大量实验表明，ControlFoley在文本引导、文本控制和音频控制等多项V2A任务中均达到最优性能，在跨模态冲突下保持卓越可控性的同时，兼具强同步性与音频质量，并展现出与工业级V2A系统相当或更优的表现。代码、模型、数据集及演示详见：https://yjx-research.github.io/ControlFoley/。