The recent surge of interest surrounding Multimodal Neural Networks (MM-NN) is attributed to their ability to effectively process and integrate information from diverse data sources. In MM-NN, features are extracted and fused from multiple modalities using adequate unimodal backbones and specific fusion networks. Although this helps strengthen the multimodal information representation, designing such networks is labor-intensive. It requires tuning the architectural parameters of the unimodal backbones, choosing the fusing point, and selecting the operations for fusion. Furthermore, multimodality AI is emerging as a cutting-edge option in Internet of Things (IoT) systems where inference latency and energy consumption are critical metrics in addition to accuracy. In this paper, we propose Harmonic-NAS, a framework for the joint optimization of unimodal backbones and multimodal fusion networks with hardware awareness on resource-constrained devices. Harmonic-NAS involves a two-tier optimization approach for the unimodal backbone architectures and fusion strategy and operators. By incorporating the hardware dimension into the optimization, evaluation results on various devices and multimodal datasets have demonstrated the superiority of Harmonic-NAS over state-of-the-art approaches achieving up to 10.9% accuracy improvement, 1.91x latency reduction, and 2.14x energy efficiency gain.
翻译:近年来,多模态神经网络(MM-NN)因其能有效处理和整合来自多种数据源的信息而受到广泛关注。在MM-NN中,通过使用合适的单模态骨干网络和特定的融合网络,从多个模态中提取并融合特征。尽管这有助于增强多模态信息表示,但设计此类网络需要大量人工,包括调整单模态骨干网络的架构参数、选择融合点以及确定融合操作。此外,多模态人工智能正成为物联网(IoT)系统中的前沿选项,在这些系统中,推理延迟和能耗是与准确性同等关键的指标。本文提出Harmonic-NAS,一个在资源受限设备上联合优化单模态骨干网络与多模态融合网络且具备硬件感知能力的框架。Harmonic-NAS采用双层优化方法,分别针对单模态骨干架构、融合策略及其算子进行优化。通过将硬件维度纳入优化过程,在多种设备和多模态数据集上的评估结果表明,Harmonic-NAS相比现有最优方法具有优越性,实现了高达10.9%的准确率提升、1.91倍的延迟降低以及2.14倍的能效增益。