The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area. Code to replicate the experiments will be released at https://github.com/GiovanniTRA/MultimodalNeuralDatabases
翻译:随着文本、图像及其他模态中半结构化数据的激增,传统查询方式已难以满足需求。多媒体信息检索填补了这一空白,并在近年来取得了令人瞩目的进展。大规模多媒体档案的搜索与检索等任务性能大幅提升,这主要得益于多模态深度学习的近期发展。然而,该领域的方法在支持的查询类型上仍存在局限,尤其是无法回答类似数据库的查询问题。为此,受神经数据库领域最新研究的启发,我们提出了一种名为多模态神经数据库的新框架。该框架能够大规模回答涉及不同输入模态(如文本与图像)推理的复杂数据库式查询。本文介绍了首个满足上述需求的架构,并通过多个基线测试验证了现有模型的局限性。实验结果表明,这些新技术在处理来自不同模态的非结构化数据方面潜力巨大,为未来研究奠定了基础。实验复现代码已发布于https://github.com/GiovanniTRA/MultimodalNeuralDatabases。