We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables generation of temporally synchronized audio. Extensive evaluations and a comprehensive user study show that our model outperforms existing work in audio quality, and in our proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.
翻译:我们提出了EgoSonics,一种根据无声第一人称视频生成具有语义意义且同步的音频轨道的技术。为无声第一人称视频生成音频,有望在虚拟现实、辅助技术或增强现有数据集等领域开辟新的应用。现有工作主要局限于语音、音乐或撞击声等特定领域,无法捕捉第一人称视频中广泛的音频频率范围。EgoSonics通过利用潜在扩散模型在条件音频合成方面的优势,解决了这些局限性。我们首先对配对的音视频数据进行编码和处理,使其适用于生成任务。随后,利用编码后的数据训练一个模型,该模型能够生成捕捉输入视频语义的音频轨道。我们提出的SyncroNet基于ControlNet构建,提供控制信号,从而实现时间同步音频的生成。广泛的评估和全面的用户研究表明,我们的模型在音频质量以及我们提出的同步评估方法上均优于现有工作。此外,我们还展示了我们的模型在改进视频摘要方面的下游应用。