We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.
翻译:本文提出StereoFoley,一种视频到音频的生成框架,能够生成语义对齐、时间同步且空间精确的48 kHz立体声。尽管近期生成式视频到音频模型在语义和时间保真度方面表现优异,但由于缺乏专业混音、空间精确的视频到音频数据集,这些模型大多仍局限于单声道或无法实现对象感知的立体声成像。首先,我们开发并训练了一个从视频生成立体声的基础模型,在语义准确性和同步性方面均达到最先进水平。其次,为克服数据集限制,我们引入了结合视频分析、对象跟踪、动态声像定位和基于距离的响度控制的合成数据生成流程,实现了空间精确的对象感知声音合成。最后,我们在此合成数据集上对基础模型进行微调,获得了清晰的对象-音频对应关系。由于缺乏现有评估指标,我们提出了立体声对象感知度量方法,并通过人耳听音实验验证了其与感知结果的高度相关性。本工作首次建立了端到端的立体声对象感知视频到音频生成框架,填补了该领域的关键空白并设立了新的技术基准。