Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens that are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods.
翻译:构建图像与文本的联合表征是视觉问答和视频问答等任务的关键步骤。本研究发现,表征不仅需要联合捕获两种模态的特征,还应具备多样性以实现更好的泛化性能。为此,我们通过多样化分词学习过程提出联合视觉-语言表征学习方法,使模型能从两种模态中学习到彼此充分解耦的词汇标记。实验表明,本方法在多数场景下优于基线模型,并与当前最先进方法具备竞争力。