Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at https://github.com/MGitHubL/TMac.
翻译:数字时代中,音视频数据无处不在,这对基于此类数据开发的深度学习模型提出了更高要求。有效处理多模态数据信息是优化音视频模型的关键。我们观察到,这些音视频数据天然具有时间属性,例如视频中每一帧的时间信息。更具体而言,此类数据根据音频与视觉线索在严格时间顺序下呈现,本质上是多模态的。这表明时间信息对于多模态声学事件建模中模态内与模态间的处理均至关重要。然而,现有方法独立处理各模态特征并进行简单融合,忽视了时间关系的挖掘,导致性能次优。基于此动机,我们提出一种用于声学事件分类的时间多模态图学习方法TMac,通过图学习技术建模此类时间信息。具体地,我们为每个声学事件构建时间图,将其音频与视频数据划分为多个片段。每个片段视为一个节点,节点间的时间关系视为其边上的时间戳。通过此方式,我们能平滑捕获模态内与模态间的动态信息。多项实验表明,TMac在性能上优于其他最先进模型。我们的代码开源于https://github.com/MGitHubL/TMac。