Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of UniBriVL in downstream tasks and its ability to choose appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems.
翻译:多模态大模型因其在多种性能与下游任务中的优势而受到广泛认可。这类模型的发展对于未来实现通用人工智能至关重要。本文提出一种名为UniBriVL的新型通用语言表示学习方法,该方法基于视觉-语言桥接(BriVL)框架。UniBriVL将音频、图像和文本嵌入共享空间,从而能够实现多种多模态应用。我们的方法解决了鲁棒语言(包括文本和音频)表示学习中的主要挑战,并有效捕捉了音频与图像之间的相关性。此外,我们展示了UniBriVL生成图像的定性评估结果,这凸显了该方法在从音频生成图像方面的潜力。总体而言,我们的实验结果证明了UniBriVL在下游任务中的有效性,以及其从音频中选择合适图像的能力。所提出的方法在语音识别、音乐信号处理和字幕系统等多个领域具有应用潜力。