With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each modality to bridge the gap between them. The modular structure of their branched networks is fundamental in creating numerous multimodal applications and has become a defacto standard to handle multiple modalities. In contrast, we propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network. An important feature of our single-branch network is that it can be trained either using single or multiple modalities without sacrificing performance. We evaluated our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations. Experimental results demonstrate the superiority of our proposed single-branch network over the existing methods in a wide range of experiments. Code: https://github.com/msaadsaeed/SBNet
翻译:随着社交媒体平台的快速发展,用户每天分享数十亿包含音频、图像和文本的多媒体帖子。研究人员致力于构建能够处理此类多媒体数据的自主系统,以解决包括跨模态检索、匹配和验证在内的复杂多模态任务。现有方法使用独立网络提取每种模态的嵌入表示以弥合模态间的差异。这种分支网络的模块化结构是众多多模态应用的基础,已成为处理多模态数据的事实标准。与此不同,我们提出了一种新颖的单分支网络,该网络无需改变结构即可学习单模态和多模态任务的判别性表示。单分支网络的一个重要特性是,它能够在不牺牲性能的前提下,使用单一模态或多模态数据进行训练。我们在具有挑战性的多模态问题(人脸-声音关联)上,针对跨模态验证和匹配任务,结合多种损失函数形式对所提出的单分支网络进行了评估。实验结果表明,在广泛的实验场景中,我们提出的单分支网络相比现有方法具有显著优越性。代码地址:https://github.com/msaadsaeed/SBNet