Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.
翻译:近期视觉语言学习领域的进展推动了多种模型与技术发展,催生了多项令人瞩目的实现成果,目前能够有效解决需要视觉与语言协同完成的各类任务。现有用于视觉语言预训练的数据集仅包含有限的视觉与语言知识,这严重制约了许多视觉语言模型的泛化能力。知识图谱与大语言模型等外部知识源能够通过填补缺失知识来弥合此类泛化差距,从而催生了混合架构的出现。在本综述中,我们分析了受益于此类混合方法的相关任务。此外,我们对现有知识源与知识类型进行了分类,进而探讨了知识图谱与大语言模型之间的抉择困境及其对未来混合方法的潜在影响。