Recently, researchers have gradually realized that in some cases, the self-supervised pre-training on large-scale Internet data is better than that of high-quality/manually labeled data sets, and multimodal/large models are better than single or bimodal/small models. In this paper, we propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL). WavBriVL projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized. We demonstrate the qualitative evaluation of the image generated from WavBriVL as a shared embedded space, with the main purposes of this paper:(1) Learning the correlation between audio and image;(2) Explore a new way of image generation, that is, use audio to generate pictures. Experimental results show that this method can effectively generate appropriate images from audio.
翻译:近期,研究人员逐渐意识到,在某些情况下,基于大规模互联网数据的自监督预训练优于高质量/人工标注数据集,且多模态/大模型优于单模态或双模态/小模型。本文提出了一种基于视觉-语言桥接(BriVL)的鲁棒音频表征学习方法WavBriVL。WavBriVL将音频、图像和文本投影至共享嵌入空间,从而实现多模态应用。我们通过WavBriVL生成的图像进行定性评估,主要研究目标包括:(1)学习音频与图像之间的关联;(2)探索图像生成的新范式,即利用音频生成图像。实验结果表明,该方法能够有效地从音频生成合适的图像。