The vanilla fusion methods still dominate a large percentage of mainstream audio-visual tasks. However, the effectiveness of vanilla fusion from a theoretical perspective is still worth discussing. Thus, this paper reconsiders the signal fused in the multimodal case from a bionics perspective and proposes a simple, plug-and-play, attention module for vanilla fusion based on fundamental signal theory and uncertainty theory. In addition, previous work on multimodal dynamic gradient modulation still relies on decoupling the modalities. So, a decoupling-free gradient modulation scheme has been designed in conjunction with the aforementioned attention module, which has various advantages over the decoupled one. Experiment results show that just a few lines of code can achieve up to 2.0% performance improvements to several multimodal classification methods. Finally, quantitative evaluation of other fusion tasks reveals the potential for additional application scenarios.
翻译:摘要:简单融合方法仍在主流视听任务中占据主导地位,然而其有效性从理论层面仍值得探讨。为此,本文从仿生学视角重新审视多模态场景下的信号融合问题,基于基础信号理论与不确定性理论,提出一种即插即用的简单注意力模块用于优化简单融合。此外,现有动态梯度调制研究仍依赖模态解耦,本文设计了一种与前述注意力模块结合的无解耦梯度调制方案,相较于解耦方案具有多重优势。实验表明,仅需数行代码即可使多种多模态分类方法性能提升最高达2.0%。最终,其他融合任务的量化评估揭示了该方法在更多应用场景中的潜力。