Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94\% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

翻译：我们发布了“影视时刻中的视听时刻”（AVMIT），这是一个大规模视听动作事件数据集。在广泛的标注任务中，11名参与者对来自“时刻中的时刻”（MIT）数据集中3秒长的视听视频子集进行了标注。每次试验中，参与者需评估标注的视听动作事件是否存在，以及它是否为视频中最显著的特征。该数据集包含57,177个视听视频的标注，每个视频由11名训练有素的参与者中的3人独立评估。基于这一初始收集，我们创建了一个包含16个不同动作类别的精选测试集，每类60个视频（共960个视频）。我们还提供了两组预计算的视听特征嵌入，分别使用VGGish/YamNet处理音频数据、VGG16/EfficientNetB0处理视觉数据，从而降低了视听深度神经网络研究的门槛。我们探索了利用AVMIT标注和特征嵌入来提升视听事件识别性能的优势。我们训练了6个循环神经网络（RNNs），分别基于AVMIT筛选后的视听事件或MIT中与模态无关的事件，并在我们的视听测试集上进行测试。在所有RNN中，仅使用视听事件训练便使top-1准确率提升了2.71-5.94%，甚至超过了训练数据量增加三倍的效果。我们预期，新标注的AVMIT数据集将成为一个宝贵资源，特别在涉及视听对应性至关重要的研究问题时，用于涉及计算模型与人类参与者的研究和对比实验。