EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos

We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos. Generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets. Existing work has been limited to domains like speech, music, or impact sounds and cannot capture the broad range of audio frequencies found in egocentric videos. EgoSonics addresses these limitations by building on the strengths of latent diffusion models for conditioned audio synthesis. We first encode and process paired audio-video data to make them suitable for generation. The encoded data is then used to train a model that can generate an audio track that captures the semantics of the input video. Our proposed SyncroNet builds on top of ControlNet to provide control signals that enables generation of temporally synchronized audio. Extensive evaluations and a comprehensive user study show that our model outperforms existing work in audio quality, and in our proposed synchronization evaluation method. Furthermore, we demonstrate downstream applications of our model in improving video summarization.

翻译：我们提出了EgoSonics，一种根据无声第一人称视频生成具有语义意义且同步的音频轨道的技术。为无声第一人称视频生成音频，有望在虚拟现实、辅助技术或增强现有数据集等领域开辟新的应用。现有工作主要局限于语音、音乐或撞击声等特定领域，无法捕捉第一人称视频中广泛的音频频率范围。EgoSonics通过利用潜在扩散模型在条件音频合成方面的优势，解决了这些局限性。我们首先对配对的音视频数据进行编码和处理，使其适用于生成任务。随后，利用编码后的数据训练一个模型，该模型能够生成捕捉输入视频语义的音频轨道。我们提出的SyncroNet基于ControlNet构建，提供控制信号，从而实现时间同步音频的生成。广泛的评估和全面的用户研究表明，我们的模型在音频质量以及我们提出的同步评估方法上均优于现有工作。此外，我们还展示了我们的模型在改进视频摘要方面的下游应用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日