Multimodal learning models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to autonomous driving. Despite the importance of multimodal learning, existing efforts focus on NLP applications, where the number of modalities is typically less than four (audio, video, text, images). However, data inputs in other domains, such as the medical field, may include X-rays, PET scans, MRIs, genetic screening, clinical notes, and more, creating a need for both efficient and accurate information fusion. Many state-of-the-art models rely on pairwise cross-modal attention, which does not scale well for applications with more than three modalities. For $n$ modalities, computing attention will result in $n \choose 2$ operations, potentially requiring considerable amounts of computational resources. To address this, we propose a new domain-neutral attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities and requires only $n$ attention operations, thus offering a significant reduction in computational complexity compared to existing cross-modal attention algorithms. Using three diverse real-world datasets as well as an additional simulation experiment, we show that our method improves performance compared to popular fusion techniques while decreasing computation costs.
翻译:多模态学习模型因其在从问答系统到自动驾驶等多样化任务上超越单模态方法而日益重要。尽管多模态学习具有重要性,现有研究主要集中在自然语言处理应用上,其中模态数量通常少于四种(音频、视频、文本、图像)。然而,在医学等其他领域的数据输入可能包括X光片、PET扫描、MRI、基因筛查、临床记录等,因此需要高效且准确的信息融合。许多最先进的模型依赖成对交叉模态注意力,这在模态数量超过三种的应用中难以扩展。对于$n$种模态,计算注意力需要执行$n \choose 2$次操作,可能消耗大量计算资源。为解决这一问题,我们提出了一种新的领域无关注意力机制——One-Versus-Others(OvO)注意力,其计算复杂度随模态数量线性增长,仅需$n$次注意力操作,因而相比现有交叉模态注意力算法显著降低了计算复杂度。利用三个多样化的真实世界数据集和一个额外的仿真实验,我们证明该方法在降低计算成本的同时,相比于流行的融合技术提升了性能。