In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ
翻译:近年来,基于扩散的生成模型凭借其逼真的生成效果和广泛的个性化应用,在视觉和音频生成领域获得了极大关注。相较于文本到图像或文本到音频生成领域的显著进展,音频到视觉或视觉到音频生成的研究发展相对缓慢。现有的音视频生成方法通常依赖于巨型大语言模型或可组合扩散模型。本文并未设计另一个用于音视频生成的巨型模型,而是回归本质,证明一个在跨模态生成中尚未被充分探索的简单轻量级生成式Transformer能够在图像到音频生成任务中取得优异效果。该Transformer在离散的音视频向量量化生成对抗网络空间中进行操作,并以掩码去噪方式进行训练。训练完成后,无需任何额外训练或修改即可直接部署无分类器引导技术以获得更优性能。由于该Transformer模型具有模态对称性,它也可直接应用于音频到图像生成及协同生成任务。实验表明,我们的简单方法超越了近期图像到音频生成方法的性能。生成音频样本可见于:https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ