Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address this problem by (i) proposing a parametric graph construction strategy for the intra-modal edges, and (ii) learning the crossmodal edges. To this end, we develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes across modalities. Experiments on a large benchmark dataset (AudioSet) show that our model is state-of-the-art (0.53 mean average precision), outperforming transformer-based models and other graph-based models.
翻译:异质图为建模涉及多种不同模态的数据提供了一种紧凑、高效且可扩展的方式。这使得利用异质图建模视听数据成为一项有吸引力的选择。然而,图结构在视听数据中并非自然存在。视听数据的图是人工构建的,这既困难又非最优。在本工作中,我们通过(i)提出一种针对模态内边的参数化图构建策略,以及(ii)学习跨模态边来解决这一问题。为此,我们开发了一种新模型——异质图跨模态网络(HGCN),该模型能够学习跨模态边。由于采用参数化构建,我们提出的模型能够适应不同的空间和时间尺度,而可学习的跨模态边则有效地连接了各模态中的相关节点。在大型基准数据集(AudioSet)上的实验表明,我们的模型达到了当前最优水平(平均精度为0.53),优于基于Transformer的模型和其他基于图的模型。