We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/
翻译:我们研究反事实视频拟音生成,旨在采用与视觉证据相矛盾的音源身份,同时保持与静音视频的时间同步。现有的视频与文本到音频模型难以应对此任务,当视频与文本内容互斥时,模型常固守于视觉隐含的音源。我们提出CounterFlow,一种用于预训练流匹配VT2A模型的推理时双阶段采样方案:第一阶段构建基于视频的时间结构并抑制视觉隐含音源;第二阶段舍弃视频条件,完全聚焦于塑造适配目标提示的音频音色。与朴素负提示方法及当前最优基准相比,CounterFlow显著提升了反事实视频拟音生成质量。为评估替换效果,我们提出一种基于文本-音频共嵌入空间的度量标准,可同时衡量目标提示的证据强度与残余视觉隐含音源的泄露程度。视频演示及代码见https://gyubin-lee.github.io/counterflow-demo/