This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation as an intermediate step in aligning Text and Audio with Images. While the dynamic temporal factors in textual and audio inputs contain various predicate adjectives that influence the meaning of the entire sentence, images, on the other hand, present static scenes. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved. This study explores the feasibility of this idea, specifically, we first propose the first Langue2Gloss model and then integrate it into the multimodal model UniBriVL for joint training. To strengthen the adaptability of gloss with text/audio and overcome the efficiency and instability issues in multimodal training, we propose a DS-Net (Data-Pair Selection Network), an Result Filter module, and a novel SP-Loss function. Our approach outperforms previous multimodal models in the main experiments, demonstrating its efficacy in enhancing multimodal representations and improving compatibility among text, audio, visual, and any sequence modalities.
翻译:本文提出了一种名为BGTAI的创新方法,通过利用注释标注作为对齐文本、音频与图像的中间步骤来简化多模态理解。虽然文本和音频输入中的动态时序因素包含影响整个句子含义的各种谓词形容词,而图像则呈现静态场景。通过将文本和音频表示为省略复杂语义细微差别的注释符号,可能实现与图像更好的对齐。本研究探讨了这一想法的可行性,具体而言,我们首先提出了首个Langue2Gloss模型,然后将其集成到多模态模型UniBriVL中进行联合训练。为了增强注释与文本/音频的适应性,并克服多模态训练中的效率和稳定性问题,我们提出了DS-Net(数据对选择网络)、结果过滤模块以及一种新颖的SP-Loss函数。我们的方法在主要实验中优于以往的多模态模型,证明了其在增强多模态表示以及提升文本、音频、视觉和任意序列模态间兼容性方面的有效性。