AudioScenic: Audio-Driven Video Scene Editing

Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude, enhancing the visual dynamics. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes, thus improving the overall temporal coherence of the edited videos. These integrated features enable AudioScenic to not only enhance visual diversity but also maintain temporal consistency throughout the video. We present a new metric named temporal score for more comprehensive validation of temporal consistency. We demonstrate substantial advancements of AudioScenic over competing methods on DAVIS and Audioset datasets.

翻译：音频驱动视觉场景编辑旨在根据给定音频信号操控视觉背景，同时保持前景内容不变。与当前主要聚焦于图像编辑的研究不同，音频驱动视频场景编辑尚未得到广泛探讨。本文提出AudioScenic，一个用于视频场景编辑的音频驱动框架。AudioScenic通过时序感知的音频语义注入过程将音频语义整合到视觉场景中。由于我们的重点是背景编辑，我们进一步引入SceneMasker模块，该模块在编辑过程中保持前景内容的完整性。AudioScenic利用音频的固有特性——即音频幅度和频率——来引导编辑过程，旨在控制时序动态并增强时序一致性。首先，我们提出音频幅度调制器模块，该模块响应音频幅度的变化调整场景的时序动态，增强视觉动态性。其次，音频频率融合器模块旨在通过将音频频率与视频场景动态对齐来确保时序一致性，从而提升编辑视频的整体时序连贯性。这些集成特性使AudioScenic不仅能够增强视觉多样性，还能在视频全程保持时序一致性。我们提出一个名为时序分数的新指标，用于更全面地验证时序一致性。我们在DAVIS和Audioset数据集上证明了AudioScenic相较于竞争方法的显著进步。