Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.

翻译：多模态表示学习作为一种学习嵌入不同模态信息及其相关性的技术，已在视觉问答（VQA）、自然语言视觉推理（NLVR）和视觉语言检索（VLR）等多种应用中取得了显著成功。在这些应用中，跨模态交互以及来自不同模态的互补信息对于先进模型执行任何多模态任务（如理解、识别、检索或生成优化）至关重要。研究者们提出了多种方法来解决这些任务。基于Transformer架构的不同变体在多种模态上表现尤为卓越。本综述全面阐述了深度学习多模态架构在处理文本、视觉和音频特征以应对多样化跨模态及现代多模态任务方面的演进与增强。本研究总结了：（i）近期面向特定任务的深度学习方法；（ii）预训练类型与多模态预训练目标；（iii）从最先进的预训练多模态方法到统一架构的发展；（iv）多模态任务类别以及为改进多模态学习可设想的未来优化方向。此外，我们为新研究者整理了一个数据集章节，涵盖了预训练与微调的大部分基准。最后，本文探讨了主要挑战、空白领域及潜在研究课题。与本综述相关的持续更新论文列表维护于 https://github.com/marslanm/multimodality-representation-learning。