Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.
翻译:音频-视觉关联学习旨在捕捉和理解音频与视觉数据之间的自然关联现象。深度学习的快速发展推动了处理音频-视觉数据的研究提案不断涌现,这在过去几年的提案数量中可见一斑,从而催生了开展全面综述的需求。除了分析该领域使用的模型,我们还探讨了人工智能多媒体中的若干定义范式与任务框架。此外,我们研究了常用的目标函数,并讨论了音频-视觉数据在优化过程中的利用方式,即音频-视觉领域知识表示的不同方法论。事实上,我们重点关注人类可理解的机制——即反映可理解知识的结构化知识——如何指导学习过程。最重要的是,我们对音频-视觉关联学习(AVCL)的最新进展进行了系统性总结,并展望了未来的研究方向。