We consider and propose a new problem of retrieving audio files relevant to multimodal design document inputs comprising both textual elements and visual imagery, e.g., birthday/greeting cards. In addition to enhancing user experience, integrating audio that matches the theme/style of these inputs also helps improve the accessibility of these documents (e.g., visually impaired people can listen to the audio instead). While recent work in audio retrieval exists, these methods and datasets are targeted explicitly towards natural images. However, our problem considers multimodal design documents (created by users using creative software) substantially different from a naturally clicked photograph. To this end, our first contribution is collecting and curating a new large-scale dataset called Melodic-Design (or MELON), comprising design documents representing various styles, themes, templates, illustrations, etc., paired with music audio. Given our paired image-text-audio dataset, our next contribution is a novel multimodal cross-attention audio retrieval (MMCAR) algorithm that enables training neural networks to learn a common shared feature space across image, text, and audio dimensions. We use these learned features to demonstrate that our method outperforms existing state-of-the-art methods and produce a new reference benchmark for the research community on our new dataset.
翻译:我们提出并研究了一个新问题:针对包含文本元素和视觉图像(例如生日/贺卡)的多模态设计文档输入,检索相关音频文件。除了提升用户体验外,整合与这些输入主题/风格相匹配的音频,还有助于提高文档的可访问性(例如,视障人士可收听音频替代)。尽管现有音频检索研究已有进展,但这些方法与数据集明确针对自然图像。然而,我们的问题考虑了由用户通过创意软件生成的多模态设计文档,其本质上与自然拍摄的照片存在显著差异。为此,我们的第一个贡献是收集并整理了一个名为Melodic-Design(简称MELON)的大规模新数据集,该数据集包含代表多种风格、主题、模板、插图等的设计文档,并配以音乐音频。基于配对图像-文本-音频数据集,我们的第二个贡献是一种新颖的多模态交叉注意力音频检索(MMCAR)算法,该算法能够训练神经网络学习图像、文本和音频维度间的通用共享特征空间。我们利用这些学习到的特征证明,我们的方法优于现有最先进方法,并为该新数据集建立了研究社区的新参考基准。