The recent surge of interest surrounding Multimodal Neural Networks (MM-NN) is attributed to their ability to effectively process and integrate multiscale information from diverse data sources. MM-NNs extract and fuse features from multiple modalities using adequate unimodal backbones and specific fusion networks. Although this helps strengthen the multimodal information representation, designing such networks is labor-intensive. It requires tuning the architectural parameters of the unimodal backbones, choosing the fusing point, and selecting the operations for fusion. Furthermore, multimodality AI is emerging as a cutting-edge option in Internet of Things (IoT) systems where inference latency and energy consumption are critical metrics in addition to accuracy. In this paper, we propose Harmonic-NAS, a framework for the joint optimization of unimodal backbones and multimodal fusion networks with hardware awareness on resource-constrained devices. Harmonic-NAS involves a two-tier optimization approach for the unimodal backbone architectures and fusion strategy and operators. By incorporating the hardware dimension into the optimization, evaluation results on various devices and multimodal datasets have demonstrated the superiority of Harmonic-NAS over state-of-the-art approaches achieving up to 10.9% accuracy improvement, 1.91x latency reduction, and 2.14x energy efficiency gain.
翻译:近年来,多模态神经网络因其能够有效处理和整合来自不同数据源的多尺度信息而备受关注。多模态网络利用适当的单模态骨干网络和特定的融合网络,从多种模态中提取并融合特征。尽管这有助于增强多模态信息表征,但设计此类网络仍是一项劳动密集型工作,需要调整单模态骨干网络的架构参数、选择融合点以及确定融合操作。此外,多模态人工智能正成为物联网系统中一种前沿选择,在这些系统中,除了准确性之外,推理延迟和能耗也是关键指标。本文提出Harmonic-NAS,一种在资源受限设备上实现单模态骨干网络与多模态融合网络联合优化的硬件感知框架。Harmonic-NAS采用双层优化方法,对单模态骨干网络架构、融合策略及算子进行联合优化。通过将硬件维度纳入优化过程,在多种设备和多模态数据集上的评估结果表明,Harmonic-NAS相较于现有最优方法具有优越性,可实现高达10.9%的准确率提升、1.91倍的延迟降低以及2.14倍的能效提升。