Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
翻译:从视频的视觉上下文中生成音频在改善我们与视听媒体交互的方式上具有多种实际应用——例如,增强闭路电视监控录像分析、修复历史视频(如无声电影)以及改进视频生成模型。我们提出了一种基于序列到序列模型从视频生成音频的新方法,改进了先前使用CNN和WaveNet并面临声音多样性和泛化挑战的工作。我们的方法采用3D向量量化变分自编码器(VQ-VAE)来捕获视频的空间和时间结构,并通过自定义音频解码器生成更广泛的声音。在Youtube8M数据集片段上训练,专注于特定领域,我们的模型旨在增强闭路电视监控录像分析、无声电影修复和视频生成模型等应用。