Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at https://github.com/MGitHubL/TMac.
翻译:在数字时代,视听数据无处不在,这对基于这些数据开发的深度学习模型提出了更高的要求。有效处理多模态数据中的信息是提升视听模型性能的关键。我们观察到,这类视听数据天然具有时间属性,例如视频中每一帧的时间信息。更具体地说,此类数据在音频和视觉线索上本质上是多模态的,并且严格按时间顺序推进。这表明时间信息在多模态声学事件建模中至关重要,无论是模态内部还是模态之间。然而,现有方法独立处理每个模态的特征并简单地将其融合,忽视了时间关系的挖掘,从而导致性能次优。基于此,我们提出了一种用于声学事件分类的时间多模态图学习方法,称为TMac,通过图学习技术对这类时间信息进行建模。具体而言,我们为每个声学事件构建一个时间图,将其音频数据和视频数据划分为多个片段。每个片段可视为一个节点,节点间的时间关系可视为边上的时间戳。通过这种方式,我们能够平滑地捕获模态内部和模态之间的动态信息。多项实验证明,TMac在性能上优于其他SOTA模型。我们的代码可访问https://github.com/MGitHubL/TMac。